[Pig Wiki] Update of "ProposedProjects" by AlanGates

Apache Wiki Tue, 26 May 2009 18:05:59 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by AlanGates:
http://wiki.apache.org/pig/ProposedProjects

------------------------------------------------------------------------------
  
  || Catagory || Project || JIRA || References || Proposed By || Votes For ||
  || Execution || Pig currently executes scripts by building a pipeline of 
pre-built operators and running data through those operators in map reduce 
jobs.  We need to investigate instead have Pig generate java code specific to a 
job, and then compiling that code and using it to run the map reduce jobs. || 
|| || Many conference attendees || gates ||
- || Language || Currently only DISTINCT, ORDER BY, and FILTER are allowed 
inside FOREACH.  All operators should be allowed in FOREACH. (Limit is being 
worked on [https://issues.apache.org/jira/browse/PIG-741 741] || || || gates || 
||
+ || Language || Currently only LIMIT, DISTINCT, ORDER BY, and FILTER are 
allowed inside FOREACH.  All operators should be allowed in FOREACH. || || || 
gates || ||
  || Optimization || Speed up comparison of tuples during shuffle for ORDER BY 
|| [https://issues.apache.org/jira/browse/PIG-659 659] || || olgan || ||
  || Optimization || Order by should be changed to not use POPackage to put all 
of the tuples in a bag on the reduce side, as the bag is just immediately 
flattened.  It can instead work like join does for the last input in the join. 
|| [https://issues.apache.org/jira/browse/PIG-802 802] || || gates || olgan ||
  || Optimization || Often in a Pig script that produces a chain of MR jobs, 
the map phases of 2nd and subsequent jobs very little.  What little they do 
should be pushed into the proceeding reduce and the map replaced by the 
identity mapper.  Initial tests showed that the identity mapper was 50% faster 
than using a Pig mapper (because Pig uses the loader to parse out tuples even 
if the map itself is empty). || [https://issues.apache.org/jira/browse/PIG-480 
480] || || olgan || gates ||
  || Optimization || Use hand crafted calls to do string to integer or float 
conversions.  Initial tests showed these could be done about 8x faster than 
String.toIntger() and String.toFloat(). || 
[https://issues.apache.org/jira/browse/PIG-482 482] || || olgan || gates ||
- || Optimization || Currently Pig always samples for and ORDER BY to determine 
how to partition, and then runs another job to do the sort.  For small enough 
inputs, it should just sort with a single reducer. || 
[https://issues.apache.org/jira/browse/PIG-483 483] || || olgan || ||
+ || Optimization || Currently Pig always samples for an ORDER BY to determine 
how to partition, and then runs another job to do the sort.  For small enough 
inputs, it should just sort with a single reducer. || 
[https://issues.apache.org/jira/browse/PIG-483 483] || || olgan || ||
- || Optimization || In many cases data to be joined is already sorted and 
partitioned on the same key.  Pig needs to be able to take advantage of this 
and do these joins in the map.  The join could be done by sampling one input to 
determine the value of the join key at the beginning of every HDFS block.  This 
would form an index.  Then in a second MR job can be run with the other input.  
Based on the key seen in the second input, the appropriate blocks of the first 
input can also be loaded into the map and the join done. || || || gates || ||
+ || Optimization || In many cases data to be joined is already sorted and 
partitioned on the same key.  Pig needs to be able to take advantage of this 
and do these joins in the map.  See http://wiki.apache.org/pig/PigMergeJoin for 
a proposal on how to do this. || || || gates || ||
  || Optimization || The combiner is not currently used if FILTER is in the 
FOREACH.  In some cases it could still be used.  || 
[https://issues.apache.org/jira/browse/PIG-479 479] || || olgan || ||
+ || Optimization || The combiner is not currently used if LIMIT is in the 
FOREACH.  ||  || || gates || ||
  || Optimization || Currently when types of data are declared Pig inserts a 
FOREACH immediately after the LOAD that does the conversions.  These 
conversions should be delayed until the field is actually used. || 
[https://issues.apache.org/jira/browse/PIG-410 410] || || olgan || gates ||
- || Optimization || When an order by is not the only operation in a pig 
script, it is done in two additional MR jobs.  The first job samples using a 
sampling loader, the second does the sort.  The sample is used to construct a 
partitioner that equally balances the data in the sort.  The sampler needs to 
be changed to be a !EvalFunc instead of a loader.  This way a split can be but 
in the proceeding MR job, with the main data being written out and the other 
part flowing to the sampler func, which can then write out the sample.  The 
final MR job can then be the sort. || 
[https://issues.apache.org/jira/browse/PIG-791 791] || || gates || olgan ||
- || Optimization || When an order by is the only operation in a pig script it 
is currently done in 3 MR jobs.  The first converts it to BinStorage format 
(because the sample loader reads that format), the second samples, and the 
third sorts.  Once the changes mentioned above to make the sampler an !EvalFunc 
are done it should be changed to be done in 2 MR jobs instead of 3. || 
[https://issues.apache.org/jira/browse/PIG-460  460] || || gates || ||
  || Optimization || The Pig optimizer should be used to determine when fields 
in a record are no longer needed and put in FOREACH statements to project out 
the unecessary data as early as possible. || 
[https://issues.apache.org/jira/browse/PIG-466 466] || || olgan || ||
  || Optimization || The Pig optimizers needs to call fieldsToRead so that Load 
functions that can do column skipping do it. || || || gates || ||
- || Scalability || Pig's default join (symmetric hash) currently depends on 
being able to fit all of the values for a given join key for one of the inputs 
into memory.  (It does try to spill to disk in the case where it cannot fit 
them all into memory.  In practice this often fails as it is not good at 
understanding when memory is low enough that it should spill.  Even in the case 
where it does not fail, spilling to disk and rereading from disk is very slow.) 
 If instances of keys with a large number of values were broken up so that the 
row set could fit in memory and then shipped to multiple reducers.  A sampling 
pass would need to be done first to determine which keys to break up. || || || 
chris olston || gates ||
+ || Scalability || Pig's default join (symmetric hash) currently depends on 
being able to fit all of the values for a given join key for one of the inputs 
into memory.  (It does try to spill to disk in the case where it cannot fit 
them all into memory.  In practice this often fails as it is not good at 
understanding when memory is low enough that it should spill.  Even in the case 
where it does not fail, spilling to disk and rereading from disk is very slow.) 
 If instances of keys with a large number of values were broken up so that the 
row set could fit in memory and then shipped to multiple reducers.  A sampling 
pass would need to be done first to determine which keys to break up.  See 
http://wiki.apache.org/pig/PigSkewedJoinSpec || || || chris olston || gates ||
- || Scalability || Improve memory footprint for a tuple || 
[https://issues.apache.org/jira/browse/PIG-793 793] || || olgan || ||
+ || Scalability || Improve memory footprint for a tuple.  See 
http://wiki.apache.org/pig/PigMemory || 
[https://issues.apache.org/jira/browse/PIG-793 793] || || olgan || ||

[Pig Wiki] Update of "ProposedProjects" by AlanGates

Reply via email to