Zheng Shao commented on PIG-833:
Hive has such a storage access layer. It consists of Hadoop File Format (which
provides row-level access via RecordReader), and SerDe and ObjectInspector
(which get the column values from the row).
Because of that, we were able to integrate RCFile (the columnar storage file
format) fair easily (NOTE: in the columnar storage case, the row is just a
handle). We did look at TFile at that time, but it was not complete yet so we
implemented RCFile by modifying SequenceFile instead of TFile.
Experiments have shown that Hive SerDe/ObjectInspector has a pretty good
performance because we always reuse the same objects for different rows.
Most of the SerDe/ObjectInspector code is in hive/trunk/serde. There is a set
of slides on http://www.slideshare.net/zshao/hive-object-model
If the Pig team is interested in taking a deeper look at Hive's
SerDe/ObjectInspector infrastructure, I am happy to provide more information.
> Storage access layer
> Key: PIG-833
> URL: https://issues.apache.org/jira/browse/PIG-833
> Project: Pig
> Issue Type: New Feature
> Reporter: Jay Tang
> Attachments: hadoop20.jar.bz2
> A layer is needed to provide a high level data access abstraction and a
> tabular view of data in Hadoop, and could free Pig users from implementing
> their own data storage/retrieval code. This layer should also include a
> columnar storage format in order to provide fast data projection,
> CPU/space-efficient data serialization, and a schema language to manage
> physical storage metadata. Eventually it could also support predicate
> pushdown for further performance improvement. Initially, this layer could be
> a contrib project in Pig and become a hadoop subproject later on.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.