[jira] Commented: (PIG-1117) Pig reading hive columnar rc tables

Alan Gates (JIRA) Fri, 18 Dec 2009 10:32:42 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792564#action_12792564
 ]


Alan Gates commented on PIG-1117:
---------------------------------

There seems to be a lot of code duplication between 
HiveColumnarLoader.setup(String, boolean, String) and 
HiveColumnarLoader.setup(String, boolean).  Could these two functions be 
combined or the common code factored out?

Pig doesn't support BOOLEAN and BYTE as an external types, we only use them 
internally.  So these should be converted to something else in 
HivecolumnarLoader.findPigDataType.

You may want to implement fieldsToRead, as that allows Pig to tell your loader 
exactly what fields it requires for this query, without requiring the user to 
specify it.

In HiveColumnarLoader.readRowColumns it is good to use 
TupleFactory.newTuple(int) rather than TupleFactory.newTuple() when you know 
the size of the tuple you'll be creating.  newTuple(int) plus Tuple.set() is 
more efficient than newTuple() + Tuple.append().

svn diff doesn't add jars to patch files, so you'll need to attach the 
hive-exec.jar separately to the jira so that we can run tests.

Also, please be aware that we are rewriting the entire load/store interface, 
and hope to release this soon, probably in 0.7.  See PIG-966 for details.  This 
obviously will affect your code.  Hopefully it will make it much easier, as the 
need to write a separate slicer will go away.


> Pig reading hive columnar rc tables
> -----------------------------------
>
>                 Key: PIG-1117
>                 URL: https://issues.apache.org/jira/browse/PIG-1117
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gerrit Jansen van Vuuren
>         Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
> PIG-1117.patch
>
>
> I've coded a LoadFunc implementation that can read from Hive Columnar RC 
> tables, this is needed for a project that I'm working on because all our data 
> is stored using the Hive thrift serialized Columnar RC format. I have looked 
> at the piggy bank but did not find any implementation that could do this. 
> We've been running it on our cluster for the last week and have worked out 
> most bugs.
>  
> There are still some improvements to be done but I would need  like setting 
> the amount of mappers based on date partitioning. Its been optimized so as to 
> read only specific columns and can churn through a data set almost 8 times 
> faster with this improvement because not all column data is read.
> I would like to contribute the class to the piggybank can you guide me in 
> what I need to do?
> I've used hive specific classes to implement this, is it possible to add this 
> to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1117) Pig reading hive columnar rc tables

Reply via email to