Re: Pig reading hive columnar rc tables

Alan Gates Mon, 30 Nov 2009 14:45:28 -0800


On Nov 30, 2009, at 12:18 PM, Dmitriy Ryaboy wrote:

That's awesome, I've been itching to do that but never got around toit..
Garrit, do you have any benchmarks on read speeds?
I don't know about putting this in piggybank, as it carries with itpretty
significant dependencies, increasing the size of the jar and making it
difficult for users to don't need it to build piggybank in the firstplace.
We might want to consider some other contrib for it -- maybe a "misc"
contrib that would have indivudual ant targets for these kinds of
compatibility submissions?

Does it have to increase the size of the piggybank jar? Instead ofincluding hive in our piggybank jar, which I agree would be bad, canwe just say that if you want to use this function you need to providethe appropriate hive jar yourself? This way we could use ivy to pullthe jars and build piggybank.

I'm not really wild about creating a new section of contrib just forfunctions that have heavier weight requirements.


Alan.

-D
On Mon, Nov 30, 2009 at 3:09 PM, Olga Natkovich <ol...@yahoo-inc.com> wrote:
Hi Garrit,

It would be great if you could contribute the code. The process is
pretty simple:

- Open a JIRA that describes what the loader does and that you would
like to contribute it to the Piggybank.
- Submit the patch that contains the loader. Make sure it has unittests
and javadoc.
On this is done, one of the committers will review and commit thepatch.
More details on how to contribute are in
http://wiki.apache.org/pig/PiggyBank.

Olga

-----Original Message-----
From: Gerrit van Vuuren [mailto:[email protected]]
Sent: Friday, November 27, 2009 2:42 AM
To: [email protected]
Subject: Pig reading hive columnar rc tables

Hi,
I've coded a LoadFunc implementation that can read from HiveColumnar RCtables, this is needed for a project that I'm working on becauseall our
data is stored using the Hive thrift serialized Columnar RC format. I
have looked at the piggy bank but did not find any implementationthat
could do this. We've been running it on our cluster for the last week
and have worked out most bugs.



There are still some improvements to be done but I would need  like
setting the amount of mappers based on date partitioning. Its been
optimized so as to read only specific columns and can churn through a
data set almost 8 times faster with this improvement because not all
column data is read.
I would like to contribute the class to the piggybank can you guideme
in what I need to do?
I've used hive specific classes to implement this, is it possibleto add
this to the piggy bank build ivy for automatic download of the
dependencies?



Thanks,

Gerrit Jansen van Vuuren

Re: Pig reading hive columnar rc tables

Reply via email to