[Pig Wiki] Update of "PigJournal" by AlanGates

Apache Wiki Wed, 11 Aug 2010 11:03:10 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal?action=diff&rev1=7&rev2=8

--------------------------------------------------

  project is still open to input on whether and when such work should be done.
  
  == Completed Work ==
- The following table contains a list of features that have been completed, as 
of Pig 0.6
+ The following table contains a list of features that have been completed, as 
of Pig 0.7
  
  || Feature                                              || Available in 
Release || Comments ||
  || Describe Schema                                      || 0.1                
  || ||
@@ -34, +34 @@

  || Outer join for default, fragment-replicate, skewed   || 0.6                
  || ||
  || Make configuration available to UDFs                 || 0.6                
  || ||
  || Load Store Redesign                                  || 0.7                
  || ||
- || Add Owl as contrib project                           || not yet released   
  || ||
  || Pig Mix 2.0                                          || not yet released   
  || ||
  
  == Work in Progress ==
  This covers work that is currently being done.  For each entry the main JIRA 
for the work is referenced.
  
- || Feature                                  || JIRA                           
                              || Comments ||
+ || Feature                                     || JIRA                        
                                || Comments ||
- || Boolean Type                             || 
[[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || ||
+ || Boolean Type                                || 
[[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || ||
- || Query Optimizer                          || 
[[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]]   || ||
+ || Query Optimizer                             || 
[[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]]  || ||
- || Cleanup of javadocs                      || 
[[https://issues.apache.org/jira/browse/PIG-1311|PIG-1311]] || ||
+ || Cleanup of javadocs                         || 
[[https://issues.apache.org/jira/browse/PIG-1311|PIG-1311]] || ||
- || UDFs in scripting languages              || 
[[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]   || ||
+ || UDFs in scripting languages                 || 
[[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]   || ||
- || Ability to specify a custom partitioner  || 
[[https://issues.apache.org/jira/browse/PIG-282|PIG-282]]   || ||
+ || Ability to specify a custom partitioner     || 
[[https://issues.apache.org/jira/browse/PIG-282|PIG-282]]   || ||
- || Pig usage stats collection               || 
[[https://issues.apache.org/jira/browse/PIG-1389|PIG-1389]], 
[[https://issues.apache.org/jira/browse/PIG-908|PIG-908]], 
[[https://issues.apache.org/jira/browse/PIG-864|PIG-864]], 
[[https://issues.apache.org/jira/browse/PIG-809|PIG-809]] || ||
+ || Pig usage stats collection                  || 
[[https://issues.apache.org/jira/browse/PIG-1389|PIG-1389]], 
[[https://issues.apache.org/jira/browse/PIG-908|PIG-908]], 
[[https://issues.apache.org/jira/browse/PIG-864|PIG-864]], 
[[https://issues.apache.org/jira/browse/PIG-809|PIG-809]] || ||
- || Make Pig available via Maven             || 
[[https://issues.apache.org/jira/browse/PIG-1334|PIG-1334]] || ||
+ || Make Pig available via Maven                || 
[[https://issues.apache.org/jira/browse/PIG-1334|PIG-1334]] || ||
- 
+ || Standard UDFs Pig Should Provide            || 
[[https://issues.apache.org/jira/browse/PIG-1405|PIG-1405]] || ||
+ || Add Scalars To Pig Latin                    || 
[[https://issues.apache.org/jira/browse/PIG-1434|PIG-1434]] || ||
+ || Run Map Reduce Jobs Directly From Pig       || 
[[https://issues.apache.org/jira/browse/PIG-506|PIG-506]]   || ||
  
  == Proposed Future Work ==
  Work that the Pig project proposes to do in the future is further broken into 
three categories:
@@ -73, +74 @@

  Within each subsection order is alphabetical and does not imply priority.
  
  === Agreed Work, Agreed Approach ===
+ ==== Make Illustrate Work ====
+ Illustrate has become Pig's ignored step-child.  Users find it very useful, 
but developers have not kept it up to date with new features (e.g. it does not 
work with merge join).  Also, the way it is currently
+ implemented it has code in many of Pig's physical operators.  This means the 
code is more complex and burdened with branches, making it harder to maintain.  
It also means that when doing new development it is
+ easy to forget about illustrate.  Illustrate needs to be redesigned in such a 
way that it does not add complexity to physical operators and that as new 
operators are developed it is necessary and easy to add
+ illustrate functionality to them.  Tests for illustrate also need to be added 
to the test suite so that it is no broken unintentionally.
+ 
+ '''Category:'''  Usability
+ 
+ '''Dependency:''' 
+ 
+ '''References:''' 
+ 
+ '''Estimated Development Effort:'''  medium
+ 
  ==== Combiner Not Used with Limit or Filter ====
  Pig Scripts that have a foreach with a nested limit or filter do not use the 
combiner even when they could.  Not all filters can use the combiner, but in 
some cases
  they can.  I think all limits could at least apply the limit in the combiner, 
though the UDF itself may only be executed in the reducer. 
@@ -91, +106 @@

  that goal to be to keep Pig Latin Hadoop independent but allow the current 
implementation to use Hadoop where it is convenient, there is no
  longer a need for this abstraction.  This abstraction makes access of HDFS 
files and directories difficult to understand.  It should be
  cleaned up.
- 
- '''Category:'''  Development
- 
- '''Dependency:'''
- 
- '''References:'''
- 
- '''Estimated Development Effort:'''  small
- 
- ==== Clean Up Memory Management ====
- As of Pig 0.6 memory management of bags has moved from the 
!SpillableMemoryManager to the bags themselves.  !SpillableMemoryManger and its
- associated classes need to be removed.
  
  '''Category:'''  Development
  
@@ -251, +254 @@

  Currently Pig's optimizer is entirely rule based.  We would like allow cost 
based optimization.  Some of this can be done with existing
  statistics (file size, etc.).  But most will require more statistics.  Pig 
needs a mechanism to generate, store and retrieve those statistics.  Most likely
  storage and retrieval
- would be done via Owl or other metadata services.  Some initial work on how 
to represent these statistics have been done in the Load-Store redesign (see
+ would be done via Howl or other metadata services.  Some initial work on how 
to represent these statistics have been done in the Load-Store redesign (see
  [[http://issues.apache.org/jira/browse/PIG-966|PIG-966]]) and as a part of 
[[http://issues.apache.org/jira/browse/PIG-760|PIG-760]].  Collection could be 
done by Pig as it
- runs queries over data, by ETL tools as they generate the data, or by 
crawlers.
+ runs queries over data, by data loading tools, or by crawlers.
  
  '''Category:'''  New Functionality
  
@@ -308, +311 @@

  
  '''Dependency:'''
  
- '''References:'''
+ '''References:''' TuringCompletePig
  
  '''Estimated Development Effort:'''  large
  
@@ -325, +328 @@

  
  '''Estimated Development Effort:'''  large and ongoing
  
- ==== SQL Expansion ====
- The original implementation of SQL implements only the most basic SQL:
-  * INSERT INTO
-  * SELECT FROM WHERE 
-  * JOIN 
-  * GROUP BY HAVING
-  * ORDER BY
-  * no subqueries
- 
- Where do we want SQL support to go?  Should we strive to implement full ANSI 
compliance?  Should we integrate with reporting tools such as Microstrategy?  Or
- should we instead focus on SQL for ETL and data pipelines?
- 
- '''Category:'''  New Functionality
- 
- '''Dependency:'''
- 
- '''References:''' [[http://issues.apache.org/jira/browse/PIG-824|PIG-824]]   
- 
- '''Estimated Development Effort:'''  depends on how much SQL we decide to 
implement
- 
- ==== Standard UDFs Pig Should Provide ====
- There are a number of UDFs in Piggybank that might be considered standard, 
such as UPPER, LOWER, etc.  These could be moved into Pig proper so that they 
are better tested
- and maintained.  Also the Pig team should consider what additional UDFs 
should be added as standard.  Categories for consideration
- include string functions, math functions, statistics functions, date and time 
functions.  We should also consider if there are
- functions unique to Pig that we should add.  For example, it has been 
suggested that Pig should have a !MakeTuple function that would
- take existing fields and create a tuple.
- 
- '''Category:'''  Usability
- 
- '''Dependency:'''  Date and time functions will depend on having a date and 
time type.
- 
- '''References:'''
- 
- '''Estimated Development Effort:'''  small to medium, depending on how many 
UDFs are moved and written
- 
- ==== Standardize on Parser and Scanner Technology ====
+ ==== Better Parser and Scanner Technology ====
- Currently Pig Latin and grunt use Javacc for parsing and scanning.  The SQL 
implementation uses Jflex for scanning and Cup for parsing.  Javacc has proven 
to be
+ Currently Pig Latin and grunt use Javacc for parsing and scanning.  Javacc 
has proven to be
- difficult to work with, very poorly documented, and gives users horrible, 
barely understandable error messages.  Pig needs to select parsing and scanning
+ difficult to work with, very poorly documented, and gives users horrible, 
barely understandable error messages.  Pig needs to select better parsing and 
scanning
- packages and use them through out.  Antlr, Sablecc, and perhaps other 
technologies need to be investigated as well.
+ packages.  Antlr, Sablecc, and perhaps other technologies need to be 
investigated as well.
  
  '''Category:'''  Developer and Usability (for better error messages)
  
@@ -375, +343 @@

  
  
  === Experimental ===
- ==== Add Scalars To Pig Latin ====
- Users have repeatedly requested the ability to something like this:
- 
- {{{
-     A = load 'myfile';
-     B = group A all;
-     C = foreach B generate COUNT(A); -- notice that this produces a relation 
with one row and one column
-     D = load 'myotherfile';
-     E = group D by $0;
-     F = foreach E generate group, sum(D.$1) / C;
- }}}
- 
- Pig Latin does not currently allow this since C is a relation (or a bag, if 
you prefer).  But it is guaranteed to be a relation with one row and one 
column.  So it
- should be possible to do something like:
- 
- {{{
-     A = load 'myfile';
-     B = group A all;
-     C = foreach B generate COUNT(A); -- notice that this produces a relation 
with one row and one column
-     D = load 'myotherfile';
-     E = group D by $0;
-     F = foreach E generate group, sum(D.$1) / (long)C;
- }}}
- 
- The planner would have to catch this and insert a store between C and D, and 
then in F C could be reloaded.  The parser should also make some effort at
- guaranteeing that C will produce a single value, though this will not be 
bullet proof (e.g. checking that the foreach only generates one column is easy, 
checking
- that it only produces one row is harder).  If C is reloaded and contains more 
than one row or column then a runtime error would occur.
- 
- '''Category:'''  New Functionality
- 
- '''Dependency:'''
- 
- '''References:'''  [[https://issues.apache.org/jira/browse/PIG-1434|PIG-1434]]
- 
- '''Estimated Development Effort:'''  Small
- 
  ==== Add List Datatype ====
  Pig has tuples (roughly equivalent to structs or records in many languages).  
Bags, which are roughly equivalent to lists, have the restriction that they can 
only
  contain tuples.  This means that users have modeled lists as bags of tuples 
of a single element.  This is confusing to users and wastes memory.  Changing 
bags to
@@ -497, +429 @@

  
  '''Estimated Development Effort:'''  medium (involves rewrite of many 
physical operators)
  
- 
- 
- ==== Run Map Reduce Jobs Directly From Pig ====
- It would be very useful to be able to run arbitrary Map Reduce jobs from 
inside Pig.  This would look something like:
- 
- {{{
-     A = load 'myfile' as (user, url);
-     B = filter A by notABot(user);
-     C = native B {jar=mr.jar ...} as (user, url, estimated_value);
-     D = group C by user;
-     E = foreach D generate user, SUM(C.estimated_value);
-     store E into 'output';
- }}}
- 
- The semantics would be that before the native command, Pig would write output 
to an HDFS file.  That file would then be input to the native program.  The 
native
- program would also dump output to HDFS, which would become the input for the 
next Pig operation.
- 
- This allows users to integrate legacy MR functionality as well as 
functionality that is better written in MR with their pig scripts. 
- 
- '''Category:'''  Integration
- 
- '''Dependency:'''
- 
- '''References:'''  [[http://issues.apache.org/jira/browse/PIG-506|PIG-506]]
- 
- '''Estimated Development Effort:'''  medium
-

[Pig Wiki] Update of "PigJournal" by AlanGates

Reply via email to