Pig on Spark - Suggestions in handling code changes out of Spark

Praveen R Tue, 30 Sep 2014 07:07:01 -0700

*Hi Everyone,*

Earlier we have made some changes on
https://github.com/sigmoidanalytics/spork/tree/spork-pig-12 to achieve
complete e2e coverage but we couldn't restrict ourselves in making changes
in pig codebase as we found it slightly easier to do.


We are now working on merging these changes to
https://github.com/apache/pig/tree/spark and had to re-look into these
changes, either find a workaround or propose the change on trunk.

Below is the gist of code changes that are made out of Spark for which the
related code can be found here <http://goo.gl/nRgldU>


   1.

   Had to comment out PigStatsUtil.addNativeJobStats(PigStats.get(), this,
   true); to get native (mapred) operator working
   2.

   Changes in PigRecordReader to identify endOfAllInput
   3.

   POUserFunc - made properties attribute public
   4.

   POCollectedGroup - getNextTuple modified to identify the end of all input
   5.

   POFRJoin - made LRs attribute public to use it during FR join
   6.

   POMergeJoin - made LRs attribute public to use it during merge join
   7.

   POStream - problem with identifying endOfAllInput, made some changes
   8.

   JsonLoader - made properties public to use from JsonStorage
   9.

   JsonStorage - uses properties from JsonLoader
   10.

   PigStorage - mRequiredColumns attribute
   11.

   BinSedesTuple, BinSedesTupleFactory - made the class serializable
   12.

   SchemaTupleBackend - changes to initialize stbInstance when null



Would like to seek upfront suggestions before I submit the related patches
and take the discussion on a issue basis.

BW, below are the jira issues relating above changes which I would be
working on. Please feel free to comment on the issue whoever is interested
in taking them up.

PIG-4193, PIG-4189, PIG-4190, PIG-4192, PIG-4200, PIG-4207, PIG-4208,
PIG-4209

Thanks,
Praveen R

Pig on Spark - Suggestions in handling code changes out of Spark

Reply via email to