[Pig Wiki] Update of "ProposedProjects" by OlgaN

Apache Wiki Wed, 29 Apr 2009 18:11:41 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ProposedProjects

------------------------------------------------------------------------------
  || Optimization || In many cases data to be joined is already sorted and 
partitioned on the same key.  Pig needs to be able to take advantage of this 
and do these joins in the map.  The join could be done by sampling one input to 
determine the value of the join key at the beginning of every HDFS block.  This 
would form an index.  Then in a second MR job can be run with the other input.  
Based on the key seen in the second input, the appropriate blocks of the first 
input can also be loaded into the map and the join done. || || gates || ||
  || Optimization || The combiner is not currently used if FILTER is in the 
FOREACH.  In some cases it could still be used.  || 
[https://issues.apache.org/jira/browse/PIG-479 479] || olgan || ||
  || Optimization || Currently when types of data are declared Pig inserts a 
FOREACH immediately after the LOAD that does the conversions.  These 
conversions should be delayed until the field is actually used. || 
[https://issues.apache.org/jira/browse/PIG-410 410] || olgan || gates ||
- || Optimization || When an order by is not the only operation in a pig 
script, it is done in two additional MR jobs.  The first job samples using a 
sampling loader, the second does the sort.  The sample is used to construct a 
partitioner that equally balances the data in the sort.  The sampler needs to 
be changed to be a !EvalFunc instead of a loader.  This way a split can be but 
in the proceeding MR job, with the main data being written out and the other 
part flowing to the sampler func, which can then write out the sample.  The 
final MR job can then be the sort. || || gates || ||
+ || Optimization || When an order by is not the only operation in a pig 
script, it is done in two additional MR jobs.  The first job samples using a 
sampling loader, the second does the sort.  The sample is used to construct a 
partitioner that equally balances the data in the sort.  The sampler needs to 
be changed to be a !EvalFunc instead of a loader.  This way a split can be but 
in the proceeding MR job, with the main data being written out and the other 
part flowing to the sampler func, which can then write out the sample.  The 
final MR job can then be the sort. || 
[https://issues.apache.org/jira/browse/PIG-791 791] || gates || olgan ||
  || Optimization || When an order by is the only operation in a pig script it 
is currently done in 3 MR jobs.  The first converts it to BinStorage format 
(because the sample loader reads that format), the second samples, and the 
third sorts.  Once the changes mentioned above to make the sampler an !EvalFunc 
are done it should be changed to be done in 2 MR jobs instead of 3. || 
[https://issues.apache.org/jira/browse/PIG-460  460] || gates || ||
  || Optimization || The Pig optimizer should be used to determine when fields 
in a record are no longer needed and put in FOREACH statements to project out 
the unecessary data as early as possible. || 
[https://issues.apache.org/jira/browse/PIG-466 466] || olgan || ||
  || Optimization || The Pig optimizers needs to call fieldsToRead so that Load 
functions that can do column skipping do it. || || gates || ||

[Pig Wiki] Update of "ProposedProjects" by OlgaN

Reply via email to