[Pig Wiki] Update of "PigJournal" by AlanGates

Apache Wiki Tue, 01 Jun 2010 12:01:49 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal?action=diff&rev1=5&rev2=6

--------------------------------------------------

  || Multiquery support                                   || 0.3                
  || ||
  || Add skewed join                                      || 0.4                
  || ||
  || Add merge join                                       || 0.4                
  || ||
+ || Add Zebra as contrib project                         || 0.4                
  || ||
  || Support Hadoop 0.20                                  || 0.5                
  || ||
  || Improved Sampling                                    || 0.6                
  || There is still room for improvement for order by sampling ||
  || Change bags to spill after reaching fixed size       || 0.6                
  || Also created bag backed by Hadoop iterator for single UDF cases ||
@@ -32, +33 @@

  || Switch local mode to Hadoop local mode               || 0.6                
  || ||
  || Outer join for default, fragment-replicate, skewed   || 0.6                
  || ||
  || Make configuration available to UDFs                 || 0.6                
  || ||
+ || Load Store Redesign                                  || 0.7                
  || ||
+ || Add Owl as contrib project                           || not yet released   
  || ||
+ || Pig Mix 2.0                                          || not yet released   
  || ||
  
  == Work in Progress ==
  This covers work that is currently being done.  For each entry the main JIRA 
for the work is referenced.
  
- || Feature                                                  || JIRA           
                                            || Comments ||
+ || Feature                                  || JIRA                           
                              || Comments ||
- || Metadata                                                 || 
[[http://issues.apache.org/jira/browse/PIG-823|PIG-823]]   || ||
+ || Boolean Type                             || 
[[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || ||
- || Query Optimizer                                          || 
[[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]] || ||
+ || Query Optimizer                          || 
[[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]]   || ||
- || Load Store Redesign                                      || 
[[http://issues.apache.org/jira/browse/PIG-966|PIG-966]]   || ||
- || Add SQL Support                                          || 
[[http://issues.apache.org/jira/browse/PIG-824|PIG-824]]   || ||
- || Change Pig internal representation of charrarry to Text  || 
[[http://issues.apache.org/jira/browse/PIG-1017|PIG-1017]] || Patch ready, 
unclear when to commit to minimize disruption to users and destabilization to 
code base. ||
- || Integration with Zebra                                   || 
[[http://issues.apache.org/jira/browse/PIG-833|PIG-833]]   || ||
+ || Cleanup of javadocs                      || 
[[https://issues.apache.org/jira/browse/PIG-1311|PIG-1311]] || ||
+ || UDFs in scripting languages              || 
[[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]   || ||
+ || Ability to specify a custom partitioner  || 
[[https://issues.apache.org/jira/browse/PIG-282|PIG-282]]   || ||
+ || Pig usage stats collection               || 
[[https://issues.apache.org/jira/browse/PIG-1389|PIG-1389]], 
[[https://issues.apache.org/jira/browse/PIG-908|PIG-908]], 
[[https://issues.apache.org/jira/browse/PIG-864|PIG-864]], 
[[https://issues.apache.org/jira/browse/PIG-809|PIG-809]] || ||
+ || Make Pig available via Maven             || 
[[https://issues.apache.org/jira/browse/PIG-1334|PIG-1334]] || ||
  
  
  == Proposed Future Work ==
@@ -68, +73 @@

  Within each subsection order is alphabetical and does not imply priority.
  
  === Agreed Work, Agreed Approach ===
- ==== Boolean Type ====
- Boolean is currently supported internally as a type in Pig, but it is not 
exposed to users.  Data cannot be of type boolean, nor can UDFs (other than
- !FilterFuncs) return boolean.  Users have repeatedly requested that boolean 
be made a full type.
- 
- '''Category:'''  New Functionality
- 
- '''Dependency:'''  Will affect all !LoadCasters, as they will have to provide 
byteToBoolean methods.
- 
- '''References:'''
- 
- '''Estimated Development Effort:'''  small
- 
  ==== Combiner Not Used with Limit or Filter ====
  Pig Scripts that have a foreach with a nested limit or filter do not use the 
combiner even when they could.  Not all filters can use the combiner, but in 
some cases
  they can.  I think all limits could at least apply the limit in the combiner, 
though the UDF itself may only be executed in the reducer. 
@@ -226, +219 @@

  
  '''Estimated Development Effort:'''  small
  
- ==== Pig Mix 2.0 ====
- Pig Mix has been a very useful tool for Pig to test performance from version 
to version and to communicate the results of those tests to users.  However, it 
was
- developed prior to release 0.3, and does not test any functionality included 
with 0.4 or later.  Also the current
- Pig Mix tests only latency and not scalability.  A new version of Pig Mix is 
needed that tests additional Pig
- functionality such as outer joins, new join implementations, makes use of the 
accumulator interface, etc.  Scalability tests also need to be
- added to Pig Mix 2.0, or a separate scalability benchmark developed, so that 
Pig developers can measure Pig's scalability as changes are
- made.
- 
- '''Category:'''  Development
- 
- '''Dependency:'''
- 
- '''References:''' [[http://wiki.apache.org/pig/PigMix|Pig Mix]]
- 
- '''Estimated Development Effort:'''  medium
- 
  ==== Pig Server ====
  Currently Pig runs as a "fat client" where all of the front end processing is 
done on the user's machine.  This has the advantage that it requires no
  installation and no maintenance of a server.  However, it has the drawback 
that upgrades require upgrading every client machine, users may be using 
@@ -286, +263 @@

  
  '''Estimated Development Effort:'''  medium
  
- ==== UDF Support in Other Languages ====
- Currently Pig users must implement UDFs in Java.  We would like to extend 
this to allow !EvalFuncs and !FilterFuncs to be implemented in scripting 
languages.
- There seems to be consensus that implementing this in one of the frameworks 
that compiles scripting languages down to Java bytecode would be simpler than
- supporting any number of languages and also would provide sufficient 
scripting support.  Specifically, Python, Ruby, and Groovy can all be supported 
in this
- manner, though Perl and C cannot.  Which framework to use for this is not 
clear.
- 
- '''Category:'''  New Functionality
- 
- '''Dependency:'''
- 
- '''References:'''  [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]
- 
- '''Estimated Development Effort:'''  medium
- 
  === Agreed Work, Unknown Approach ===
- 
  ==== Clarify Pig Latin Semantics ====
  There are areas of Pig Latin semantics that are not clear or not consistent.  
Take for example, a script like:
  
@@ -383, +345 @@

  
  '''Estimated Development Effort:'''  depends on how much SQL we decide to 
implement
  
- ==== Standard UDFs Should Pig Provide ====
+ ==== Standard UDFs Pig Should Provide ====
  There are a number of UDFs in Piggybank that might be considered standard, 
such as UPPER, LOWER, etc.  These could be moved into Pig proper so that they 
are better tested
  and maintained.  Also the Pig team should consider what additional UDFs 
should be added as standard.  Categories for consideration
  include string functions, math functions, statistics functions, date and time 
functions.  We should also consider if there are
@@ -412, +374 @@

  '''Estimated Development Effort:'''  medium
  
  
- ==== Statistics on Usage ====
- It would be very useful for Pig developers if Pig collected statistics of how 
users used Pig.  This could include what scripts were run, basic 
characteristics of
- the data, etc.  Note that this is separate from collecting statistics about 
data for the optimizer, though the two may share some functionality.  Also, 
this will
- raises security concerns (who gets to see who ran what) and thus will have to 
be configurable from site to site.  This has been placed in the unknown approach
- section because no design of how to collect statistics, where to store them, 
etc. has been proposed.
- 
- '''Category:'''  New Functionality
- 
- '''Dependency:'''
- 
- '''References:'''
- 
- '''Estimated Development Effort:'''  medium
- 
  === Experimental ===
+ ==== Add Scalars To Pig Latin ====
+ Users have repeatedly requested the ability to something like this:
- ==== Custom Partitioner ====
- Hadoop allows !MapReduce users to set a custom partitioner between Map and 
Reduce phases.  Users would like to use these
- partitioners in their Pig scripts.  In some situations Pig sets its own 
custom partitioner (order by, skew join), so
- users would not be override the partitioner in this case.  
  
- '''Category:'''  New Functionality
+ {{{
+     A = load 'myfile';
+     B = group A all;
+     C = foreach B generate COUNT(A); -- notice that this produces a relation 
with one row and one column
+     D = load 'myotherfile';
+     E = group D by $0;
+     F = foreach E generate group, sum(D.$1) / C;
+ }}}
  
- '''Dependency:'''
+ Pig Latin does not currently allow this since C is a relation (or a bag, if 
you prefer).  But it is guaranteed to be a relation with one row and one 
column.  So it
+ should be possible to do something like:
  
- '''References:''' [[https://issues.apache.org/jira/browse/PIG-282|Pig-282]]
+ {{{
+     A = load 'myfile';
+     B = group A all;
+     C = foreach B generate COUNT(A); -- notice that this produces a relation 
with one row and one column
+     D = load 'myotherfile';
+     E = group D by $0;
+     F = foreach E generate group, sum(D.$1) / (long)C;
+ }}}
  
+ The planner would have to catch this and insert a store between C and D, and 
then in F C could be reloaded.  The parser should also make some effort at
+ guaranteeing that C will produce a single value, though this will not be 
bullet proof (e.g. checking that the foreach only generates one column is easy, 
checking
+ that it only produces one row is harder).  If C is reloaded and contains more 
than one row or column then a runtime error would occur.
+ 
+ '''Category:'''  New Functionality
+ 
+ '''Dependency:'''
+ 
+ '''References:'''
+ 
- '''Estimated Development Effort:'''  small
+ '''Estimated Development Effort:'''  Small
+ 
+ ==== Add List Datatype ====
+ Pig has tuples (roughly equivalent to structs or records in many languages).  
Bags, which are roughly equivalent to lists, have the restriction that they can 
only
+ contain tuples.  This means that users have modeled lists as bags of tuples 
of a single element.  This is confusing to users and wastes memory.  Changing 
bags to
+ take any type would be very disruptive, since much existing Pig code is built 
around the assumption that bags only contain tuples.  Additionally bags contain
+ extensive functionality to handle memory management, spilling, etc.  A list 
type need not offer all these features.  Therefore the best route to adding this
+ functionality may be to add a list type to Pig Latin.
+ 
+ '''Category:'''  New Feature
+ 
+ '''Dependency:'''
+ 
+ '''References:'''
+ 
+ '''Estimated Development Effort:'''  Medium
  
  ==== Automated Hadoop Tuning ====
  Hadoop has many configuration parameters that can affect the latency and 
scalability of a job.  For different types of jobs, different configurations 
will yield

[Pig Wiki] Update of "PigJournal" by AlanGates

Reply via email to