[Pig Wiki] Update of "PigJournal" by AlanGates

Apache Wiki Thu, 14 Jan 2010 10:36:19 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal?action=diff&rev1=1&rev2=2

--------------------------------------------------

  
  '''Estimated Development Effort:'''  small
  
+ ==== Combiner Not Used with Limit or Filter ====
+ Pig Scripts that have a foreach with a nested limit or filter do not use the 
combiner even when they could.  Not all filters can use the combiner, but in 
some cases
+ they can.  I think all limits could at least apply the limit in the combiner, 
though the UDF itself may only be executed in the reducer. 
+ 
+ '''Category:'''  Performance
+ 
+ '''Dependency:''' Map Reduce Optimizer 
+ 
+ '''References:''' [[https://issues.apache.org/jira/browse/PIG-479|PIG-479]]
+ 
+ '''Estimated Development Effort:'''  small
+ 
  ==== Clean Up File Access Packages ====
  Early on Pig sought to be completely Hadoop independent in its front end 
processing (parsing, logical plan, optimizer).  To this end a number
  of abstractions were created for file access, which are located in the 
org.apache.pig.backend.datastorage package.  Now that we have modified
@@ -189, +201 @@

  '''References:'''
  
  '''Estimated Development Effort:'''  large
+ 
+ ==== Order By for Small Data ====
+ Currently Pig always samples the data for an order by and splits it across 
multiple machines.  In cases where the data to be ordered is small enough to 
fit on a
+ single node, the sample stage should be eliminated and the sorting done by a 
identity mapper plus reduce job.
+ 
+ '''Category:'''  Performance
+ 
+ '''Dependency:'''
+ 
+ '''References:''' [[https://issues.apache.org/jira/browse/PIG-483|PIG-483]]
+ 
+ '''Estimated Development Effort:'''  small
  
  ==== Outer Join for Merge Join ====
  Merge join is the last join type to not support outer join.  Right outer join 
is doable in the current infrastructure.  Left and full outer join will require 
an
@@ -445, +469 @@

  
  '''Estimated Development Effort:'''  depends on what type of integration is 
chosen
  
+ ==== Physical Operators Take List of Tuples ====
+ Currently tuples are passed one at a time between physical operators.  Moving 
all the way through the pipeline for each tuple causes a lot of context 
switching.  We
+ need to investigate batching tuples and passing a list between operators 
instead.  In the map phase this would be likely to help, though we would want to
+ re-implement our map implementation to take control from Map Reduce so we get 
multiple records at a time.  In reduce it is less clear, since tuples in reduce 
can
+ tend to be large (since they already contain the group) and thus batching 
them may cause memory problems.
+ 
+ '''Category:'''  Performance
+ 
+ '''Dependency:'''
+ 
+ '''References:''' [[https://issues.apache.org/jira/browse/PIG-688|PIG-688]]
+ 
+ '''Estimated Development Effort:'''  medium (involves rewrite of many 
physical operators)
+ 
+ 
+ 
  ==== Run Map Reduce Jobs Directly From Pig ====
  It would be very useful to be able to run arbitrary Map Reduce jobs from 
inside Pig.  This would look something like:

[Pig Wiki] Update of "PigJournal" by AlanGates

Reply via email to