[jira] Updated: (HIVE-553) Add BinarySortableSerDe to Hive

Zheng Shao (JIRA) Wed, 08 Jul 2009 20:02:42 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zheng Shao updated HIVE-553:
----------------------------

    Attachment: HIVE-553.4.patch

Removed the println.

Also I reverted to include the serialization/deserialization of maps. The 
reason is that I think it will take some additional time to work out the 
LazyBinarySerDe. In the meanwhile, we can just use BinarySortableSerDe for that 
purpose.



BinarySortableSerDe may not be as compact as LazyBinarySerDe, but it is still 
much better than LazySimpleSerDe since numerical values are in binary instead 
of text format.  In most cases when the columns are all simple columns, 
BinarySortableSerDe does not create new objects per row as well.

I will do a benchmark of BinarySortableSerDe (for replacing LazySimpleSerDe in 
various places) after this gets in. If the performance is already improved a 
lot (most probably), we can replace LazySimpleSerDe with it. And we can 
deprioritize the need of adding LazyBinarySerDe till some time later.

What do you think?


> Add BinarySortableSerDe to Hive
> -------------------------------
>
>                 Key: HIVE-553
>                 URL: https://issues.apache.org/jira/browse/HIVE-553
>             Project: Hadoop Hive
>          Issue Type: New Feature
>    Affects Versions: 0.3.0, 0.3.1
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>         Attachments: HIVE-553.2.patch, HIVE-553.3.patch, HIVE-553.4.patch
>
>
> Currently the most popular SerDe in Hive is LazySimpleSerDe. LazySimpleSerDe 
> has the benefit of being simple (use text format to store data), but its 
> performance may suffer in the following cases:
> 1. For double values, we are storing them in text format which is very 
> space-inefficient, and both serialization and deserialization are slow;
> 2. For complex type of columns that contains a lot of levels, we are scanning 
> the buffer once per level, which is very inefficient.
> We should add a binary serde format that stores the data in binary format. 
> The format should have the following properties:
> 1. Compact: it should be space-efficient;
> 2. Fast: it should be efficiently to deserialize the data, especially for 
> double values and complex types.
> 3. It should support serializing NULL values.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-553) Add BinarySortableSerDe to Hive

Reply via email to