[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

Gerrit Jansen van Vuuren (JIRA) Tue, 22 Dec 2009 10:40:53 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gerrit Jansen van Vuuren updated PIG-1117:
------------------------------------------

    Attachment: PIG-117-v.0.6.0.patch

Patch for pig version 0.6.0 (should work for previous versions at least for 
0.5.0).
Contains the following:
 Improved HiveRCLoader with Slicer that does the slicing correctly based on 
file blocks. Previous version just read the whole file and all its associated 
block from one task.
 Refactored to make Byte and Boolean values Integer.
 Refactored to take out code duplication in setup method of HiveRCLoader.
 build.xml automatically downloads the hive jars from apache website(only once 
if the hive deps haven't been downloaded already).
 To build piggybank jar with HiveRCLoader inside use ant hive-jar
 
To use the hive_exec.jar must be available to the pig jobs and the piggybank 
jar plus the hive_exec.jar must be either Registered with the Pig Script or 
available on the class path.

> Pig reading hive columnar rc tables
> -----------------------------------
>
>                 Key: PIG-1117
>                 URL: https://issues.apache.org/jira/browse/PIG-1117
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gerrit Jansen van Vuuren
>         Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, 
> PIG-1117.patch, PIG-117-v.0.6.0.patch
>
>
> I've coded a LoadFunc implementation that can read from Hive Columnar RC 
> tables, this is needed for a project that I'm working on because all our data 
> is stored using the Hive thrift serialized Columnar RC format. I have looked 
> at the piggy bank but did not find any implementation that could do this. 
> We've been running it on our cluster for the last week and have worked out 
> most bugs.
>  
> There are still some improvements to be done but I would need  like setting 
> the amount of mappers based on date partitioning. Its been optimized so as to 
> read only specific columns and can churn through a data set almost 8 times 
> faster with this improvement because not all column data is read.
> I would like to contribute the class to the piggybank can you guide me in 
> what I need to do?
> I've used hive specific classes to implement this, is it possible to add this 
> to the piggy bank build ivy for automatic download of the dependencies?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1117) Pig reading hive columnar rc tables

Reply via email to