[ https://issues.apache.org/jira/browse/PIG-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gerrit Jansen van Vuuren updated PIG-1117: ------------------------------------------ Attachment: PIG-1117.patch This patch contains the following: Improved HiveColumnarLoader: -> Implements Slicer interface that returns the correct number of slices when date filtering is used. -> Performance improvement in how columns are read. TestHiveColumnarLoader -> Better Testing and improved cleanup build.xml -> Updated build.xml file with the following tasks: hive-compile, hive-javadoc, hive-jar, hive-test, hive-compile-test. These targets do not compile hive, but compiles the udfs that depend on hive classes e.g HiveColumnarLoader. lib-hivedeps -> This contains all of the hive jars for the hive dependent udfs. -> currently the only hive jar needed is hive-exec.jar The hive dependent udf source and source test is separated from the rest of the source code like so: The source directory structure is: src/main/java src/main/java-hiveudfs src/test/java src/test/java-hiveudfs This allows all other udfs that only depend on pig to compile without bothering with the hive dependent udfs. To include all of the udfs and the hive dependent udfs (in this case HiveColumnarLoader) into the final jar type ant hive-jar. Please comment on ideas and if this is an accepted approach for compiling and testing this class. Something I've noted while compiling against the newest trunk version of pig is that the method signature for the LoadFunc interface has changed the method: From public void fieldsToRead(Schema schema); To public RequiredFieldResponse fieldsToRead(RequiredFieldList requiredFieldList) throws FrontendException; So this source will only work before this change was done. > Pig reading hive columnar rc tables > ----------------------------------- > > Key: PIG-1117 > URL: https://issues.apache.org/jira/browse/PIG-1117 > Project: Pig > Issue Type: New Feature > Reporter: Gerrit Jansen van Vuuren > Attachments: HiveColumnarLoader.patch, HiveColumnarLoaderTest.patch, > PIG-1117.patch > > > I've coded a LoadFunc implementation that can read from Hive Columnar RC > tables, this is needed for a project that I'm working on because all our data > is stored using the Hive thrift serialized Columnar RC format. I have looked > at the piggy bank but did not find any implementation that could do this. > We've been running it on our cluster for the last week and have worked out > most bugs. > > There are still some improvements to be done but I would need like setting > the amount of mappers based on date partitioning. Its been optimized so as to > read only specific columns and can churn through a data set almost 8 times > faster with this improvement because not all column data is read. > I would like to contribute the class to the piggybank can you guide me in > what I need to do? > I've used hive specific classes to implement this, is it possible to add this > to the piggy bank build ivy for automatic download of the dependencies? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.