Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal

--------------------------------------------------

New page:
= Pig Journal =
This document is a successor to the ProposedRoadMap.  Rather than simply 
propose the work going forward for Pig, it also summarizes work done in the 
past (back
to Pig moving from a research project at Yahoo Labs to being a part of the 
Yahoo grid team, which was approximately the time Pig was first released to open
source), current work, and proposed future work.  Note that proposed future 
work is exactly that, __proposed__.  There is no guarantee that it will be 
done, and the
project is still open to input on whether and when such work should be done.

== Completed Work ==
The following table contains a list of features that have been completed, as of 
Pig 0.6

|| Feature                                              || Available in Release 
|| Comments ||
|| Describe Schema                                      || 0.1                  
|| ||
|| Explain Plan                                         || 0.1                  
|| ||
|| Add log4j to Pig Latin                               || 0.1                  
|| ||
|| Parameterized Queries                                || 0.1                  
|| ||
|| Streaming                                            || 0.1                  
|| ||
|| Documentation                                        || 0.2                  
|| Docs are never really done of course, but Pig now has a setup document, 
tutorial, Pig Latin users and reference guides, a cookbook, a UDF writers 
guide, and API javadocs. ||
|| Early error detection and failure                    || 0.2                  
|| When this was originally added to the !ProposedRoadMap it referred to being 
able to do type checking and other basic semantic checks. ||
|| Remove automatic string encoding                     || 0.2                  
|| ||
|| Add ORDER BY DESC                                    || 0.2                  
|| ||
|| Add LIMIT                                            || 0.2                  
|| ||
|| Add support for NULL values                          || 0.2                  
|| ||
|| Types beyond String                                  || 0.2                  
|| ||
|| Multiquery support                                   || 0.3                  
|| ||
|| Add skewed join                                      || 0.4                  
|| ||
|| Add merge join                                       || 0.4                  
|| ||
|| Support Hadoop 0.20                                  || 0.5                  
|| ||
|| Improved Sampling                                    || 0.6                  
|| There is still room for improvement for order by sampling ||
|| Change bags to spill after reaching fixed size       || 0.6                  
|| Also created bag backed by Hadoop iterator for single UDF cases ||
|| Add Accumulator interface for UDFs                   || 0.6                  
|| ||
|| Switch local mode to Hadoop local mode               || 0.6                  
|| ||
|| Outer join for default, fragment-replicate, skewed   || 0.6                  
|| ||
|| Make configuration available to UDFs                 || 0.6                  
|| ||

== Work in Progress ==
This covers work that is currently being done.  For each entry the main JIRA 
for the work is referenced.

|| Feature                                                  || JIRA             
                                          || Comments ||
|| Metadata                                                 || 
[[http://issues.apache.org/jira/browse/PIG-823|PIG-823]]   || ||
|| Query Optimizer                                          || 
[[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]] || ||
|| Load Store Redesign                                      || 
[[http://issues.apache.org/jira/browse/PIG-966|PIG-966]]   || ||
|| Add SQL Support                                          || 
[[http://issues.apache.org/jira/browse/PIG-824|PIG-824]]   || ||
|| Change Pig internal representation of charrarry to Text  || 
[[http://issues.apache.org/jira/browse/PIG-1017|PIG-1017]] || Patch ready, 
unclear when to commit to minimize disruption to users and destabilization to 
code base. ||
|| Integration with Zebra                                   || 
[[http://issues.apache.org/jira/browse/PIG-833|PIG-833]]   || ||


== Proposed Future Work ==
Work that the Pig project proposes to do in the future is further broken into 
three categories:
 1. Work that we are agreed needs to be done, and also the approach to the work 
is generally agreed upon, but we have not gotten to it yet
 2. Work that we are agreed needs to be done, but the approach is not yet clear 
or there is not general agreement as to which approach is best
 3. Experimental, which includes features that may not yet be agreed upon or 
where we do not know if they will be beneficial or not

For each of these proposed features, a brief description is given plus the 
following information:
 * Category - what type of feature is this; categories include:
   * Performance
   * Usability
   * Integration with other Hadoop Subprojects
   * New Functionality
   * Development - that is proposed features that will be of interest to 
development but may not make noticeable changes for users
 * Dependencies - any other feature or proposed feature that this depends on
 * References - any relevant JIRAs, wiki pages, white papers, etc.
 * Estimated Development Effort - a __guess__ at how long this will take, with 
the following three categories:
   * small - less than 1 person month
   * medium - 1-3 person months
   * large - more than 3 person months

Within each subsection order is alphabetical and does not imply priority.

=== Agreed Work, Agreed Approach ===
==== Boolean Type ====
Boolean is currently supported internally as a type in Pig, but it is not 
exposed to users.  Data cannot be of type boolean, nor can UDFs (other than
!FilterFuncs) return boolean.  Users have repeatedly requested that boolean be 
made a full type.

'''Category:'''  New Functionality

'''Dependency:'''  Will affect all !LoadCasters, as they will have to provide 
byteToBoolean methods.

'''References:'''

'''Estimated Development Effort:'''  small

==== Clean Up File Access Packages ====
Early on Pig sought to be completely Hadoop independent in its front end 
processing (parsing, logical plan, optimizer).  To this end a number
of abstractions were created for file access, which are located in the 
org.apache.pig.backend.datastorage package.  Now that we have modified
that goal to be to keep Pig Latin Hadoop independent but allow the current 
implementation to use Hadoop where it is convenient, there is no
longer a need for this abstraction.  This abstraction makes access of HDFS 
files and directories difficult to understand.  It should be
cleaned up.

'''Category:'''  Development

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  small

==== Clean Up Memory Management ====
As of Pig 0.6 memory management of bags has moved from the 
!SpillableMemoryManager to the bags themselves.  !SpillableMemoryManger and its
associated classes need to be removed.

'''Category:'''  Development

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  small

==== Developer Documentation ====
Pig needs comprehensive design documentation to assist developers when they are 
working in areas of the code.  It also needs good Java docs to assist 
developers.
Currently there is no comprehensive Pig functional specification or design 
documentation.  The Java docs that exist are incomplete and inconsistent. 

'''Category:'''  Development

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  small

==== Date and Time Types ====
Date and time types need to be added to Pig, to allow users to store and 
analyze time based types without the need to handle translation and write all 
their own
manipulation routines.  We hope that we can find an implementation of time 
types in existing
open source project (perhaps in [[http://db.apache.org/derby/|Apache Derby]] or 
a similar project) which could then be integrated with Pig rather than 
implementing
the representation and operators from scratch.

'''Category:'''  New Functionality

'''Dependency:'''  Will affect all !LoadCasters, as they will have to provide 
byteToDate methods.

'''References:'''

'''Estimated Development Effort:'''  medium

==== Error Handling ====
Pig's error handling is not good.  There are two parts to this problem.  First, 
users frequently complain that the error messages give no useful information as 
to what
the problem is or how to fix it.  Work needs to be done to assure that error 
messages are meaningful to users and that an error resolution guide exists to 
help
users understand what an error message means and actions they should take to 
remedy the situation.  Second, Map Reduce does not reliably return error 
messages
that occur during Map Reduce execution.  Since the error is returned to Pig as 
one long Java String (rather than as an Exception object) Pig is left to attempt
to decipher which portion of the error message is meaningful to the user and 
which is not.  Pig is not always successful in this attempt.  Map Reduce needs 
to
return the error to Pig in an object format so it can more easily determine the 
relevant part of the error.

'''Category:'''  Usability

'''Dependency:'''  Exceptions from Map Reduce as Exception, not String; 
Standardize on Parser and Scanner Technology because many of the bad error 
messages come from
the parser

'''References:'''

'''Estimated Development Effort:'''  medium

==== Fixed Point Type ====
Pig currently supports the floating point types float and double.  These are 
not adequate for data where loss of precision is not acceptable, such as 
financial data.
To address this issue Pig needs to add a fixed point type, similar to SQL's 
decimal type.  We hope that we can find an implementation of fixed type in 
existing
an open source project (perhaps in [[http://db.apache.org/derby/|Apache Derby]] 
or similar project) that could then be integrated with Pig rather than 
implementing
the representation and operators from scratch.

'''Category:'''  New Functionality

'''Dependency:'''  Will affect all !LoadCasters, as they will have to provide 
byteToFixed methods.

'''References:'''

'''Estimated Development Effort:'''  medium

==== Map Reduce Optimizer ====
Currently the optimizations in the Map Reduce layer (such as using the 
combiner, stitching together multi-store queries into one MR job, etc.) are a 
hodge-podge
of visitors.  These need to be reworked to use use an optimizer framework like 
the logical optimizer.  The hope is that once the logical optimizer is reworked 
and
stabilized the same framework can be used to rework the Map Reduce optimizer.

'''Category:'''  Development

'''Dependency:'''  Logical optimizer rework (see 
[[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]])

'''References:'''

'''Estimated Development Effort:'''  large


==== Nesting, Full Support of Pig Latin Inside Foreach ====
Currently only FILTER, ORDER, DISTINCT, and LIMIT are supported inside FOREACH. 
 To support fully arbitrary levels of nesting in data we need to support the 
rest of
Pig Latin.

'''Category:'''  New Functionality

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  large

==== Outer Join for Merge Join ====
Merge join is the last join type to not support outer join.  Right outer join 
is doable in the current infrastructure.  Left and full outer join will require 
an
index (either in the data or built by a preliminary MR job, just as index 
required of the right side is now).

'''Category:'''  New Functionality

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  small

==== Pig Mix 2.0 ====
Pig Mix has been a very useful tool for Pig to test performance from version to 
version and to communicate the results of those tests to users.  However, it was
developed prior to release 0.3, and does not test any functionality included 
with 0.4 or later.  Also the current
Pig Mix tests only latency and not scalability.  A new version of Pig Mix is 
needed that tests additional Pig
functionality such as outer joins, new join implementations, makes use of the 
accumulator interface, etc.  Scalability tests also need to be
added to Pig Mix 2.0, or a separate scalability benchmark developed, so that 
Pig developers can measure Pig's scalability as changes are
made.

'''Category:'''  Development

'''Dependency:'''

'''References:''' [[http://wiki.apache.org/pig/PigMix|Pig Mix]]

'''Estimated Development Effort:'''  medium

==== Pig Server ====
Currently Pig runs as a "fat client" where all of the front end processing is 
done on the user's machine.  This has the advantage that it requires no
installation and no maintenance of a server.  However, it has the drawback that 
upgrades require upgrading every client machine, users may be using 
different versions of Pig without intending to as they move from machine to 
machine, and services such as logging and security cannot be centralized.

'''Category:'''  Usability

'''Dependency:'''  As a Pig server would most likely be multi-threaded, this 
project would require cleaning up Pig's code to be thread safe.

'''References:'''  [[http://issues.apache.org/jira/browse/PIG-603|PIG-603]]

'''Estimated Development Effort:'''  large

==== Specifying the Value Type for Maps ====
Currently maps require that their key be of type String, while allowing their 
values to be of any type.  In practice, Pig assigns a type bytearray to the 
value.
If the value is actually another type (that is the loader or UDF that created 
the Map creates it as another type and not a !DataByteArray) the script writer 
is still
required to cast the value to the type it already is so that Pig understands 
how to handle the data.  Given that users often store only one type of data in 
a map,
it would be convenient for them to be able to specify a type for the value as 
well.  The contract would then be that all values in that map must be of the
specified type.  By default maps would still leave the value unspecified.

'''Category:'''  New Functionality

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  small

==== Statistics for Optimizer ====
Currently Pig's optimizer is entirely rule based.  We would like allow cost 
based optimization.  Some of this can be done with existing
statistics (file size, etc.).  But most will require more statistics.  Pig 
needs a mechanism to generate, store and retrieve those statistics.  Most likely
storage and retrieval
would be done via Owl or other metadata services.  Some initial work on how to 
represent these statistics have been done in the Load-Store redesign (see
[[http://issues.apache.org/jira/browse/PIG-966|PIG-966]]) and as a part of 
[[http://issues.apache.org/jira/browse/PIG-760|PIG-760]].  Collection could be 
done by Pig as it
runs queries over data, by ETL tools as they generate the data, or by crawlers.

'''Category:'''  New Functionality

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  medium

==== UDF Support in Other Languages ====
Currently Pig users must implement UDFs in Java.  We would like to extend this 
to allow !EvalFuncs and !FilterFuncs to be implemented in scripting languages.
There seems to be consensus that implementing this in one of the frameworks 
that compiles scripting languages down to Java bytecode would be simpler than
supporting any number of languages and also would provide sufficient scripting 
support.  Specifically, Python, Ruby, and Groovy can all be supported in this
manner, though Perl and C cannot.  Which framework to use for this is not clear.

'''Category:'''  New Functionality

'''Dependency:'''

'''References:'''  [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]

'''Estimated Development Effort:'''  medium

=== Agreed Work, Unknown Approach ===

==== Clarify Pig Latin Semantics ====
There are areas of Pig Latin semantics that are not clear or not consistent.  
Take for example, a script like:

{{{
    A = load 'foo' AS (a: bag, b: int);
    B = foreach A generate flatten(a);
}}}

What is the schema of B? It should be unknown, since the schema of a is 
unknown.  Currently it is instead assigned a schema of (bytearray).

Solving this involves two steps.  First, a definitive, clear, consistent 
grammar needs to be developed for Pig Latin.  Second, the front end code 
(mostly the
LogicalPlan and the type checker) need to be modified to assure that they 
conform to this specification. 

'''Category:'''  Usability

'''Dependency:'''  Should be done after a parser technology is selected as 
standard (see Standardize on Parser and Scanner Technology) since it will 
require changes
to the grammar.

'''References:'''

'''Estimated Development Effort:'''  medium

==== Extending Pig to Include Branching, Looping, and Functions ====
It would be very convenient for Pig Latin to include branching, looping, and 
function calls.  Consider for example a program where the user wishes to 
iterate over
data until it begins to converge:

{{{
    A = load 'webcrawl' (url: chararray, links: bag);
    while (unresolved_links(links) > 0.9 * COUNT(links)) {
        -- resolve links
        ...
    }
    store Z into 'webmap';
}}}

There are at least two ways this could be accomplished.  One, Pig Latin itself 
could be extended to include these features.  Two, Pig Latin could be embedded 
in an
existing scripting language (such as Python, Ruby, Perl, maybe others) and the 
branching, looping, and function constructs in that language provide Pig
control flow.  There are advantages and disadvantages to each.  Hybrid 
approaches (e.g. branching and looping in a script language, functions or 
macros in Pig
Latin) are also possible.  The Pig team needs to come to a consensus on which 
path to choose. 

'''Category:'''  New functionality

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  large

==== IDE for Pig ====
!PigPen was developed and released for Pig with 0.2.  However, it has not been 
kept up to date.  Users have consistently expressed interest
in an IDE for Pig.  Ideally this would also include tools for writing UDFs, not 
just Pig Latin scripts.  One option is to bring !PigPen up to date and maintain 
it.
Another option is to build a browser based IDE.  Some have suggested that this 
would be better than an Eclipse based one.

'''Category:'''  New Functionality

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  large and ongoing

==== SQL Expansion ====
The original implementation of SQL implements only the most basic SQL:
 * INSERT INTO
 * SELECT FROM WHERE 
 * JOIN 
 * GROUP BY HAVING
 * ORDER BY
 * no subqueries

Where do we want SQL support to go?  Should we strive to implement full ANSI 
compliance?  Should we integrate with reporting tools such as Microstrategy?  Or
should we instead focus on SQL for ETL and data pipelines?

'''Category:'''  New Functionality

'''Dependency:'''

'''References:''' [[http://issues.apache.org/jira/browse/PIG-824|PIG-824]]   

'''Estimated Development Effort:'''  depends on how much SQL we decide to 
implement

==== Standardize on Parser and Scanner Technology ====
Currently Pig Latin and grunt use Javacc for parsing and scanning.  The SQL 
implementation uses Jflex for scanning and Cup for parsing.  Javacc has proven 
to be
difficult to work with, very poorly documented, and gives users horrible, 
barely understandable error messages.  Pig needs to select parsing and scanning
packages and use them through out.  Antlr, Sablecc, and perhaps other 
technologies need to be investigated as well.

'''Category:'''  Developer and Usability (for better error messages)

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  medium


==== Statistics on Usage ====
It would be very useful for Pig developers if Pig collected statistics of how 
users used Pig.  This could include what scripts were run, basic 
characteristics of
the data, etc.  Note that this is separate from collecting statistics about 
data for the optimizer, though the two may share some functionality.  Also, 
this will
raises security concerns (who gets to see who ran what) and thus will have to 
be configurable from site to site.  This has been placed in the unknown approach
section because no design of how to collect statistics, where to store them, 
etc. has been proposed.

'''Category:'''  New Functionality

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  medium

=== Experimental ===
==== Automated Hadoop Tuning ====
Hadoop has many configuration parameters that can affect the latency and 
scalability of a job.  For different types of jobs, different configurations 
will yield
optimal results.  For example, a job with no memory intensive operations in the 
map phase but with a combine phase will want to set Hadoop's io.sort.mb quite
high, to minimize the number of spills from the map.  But a job with a memory 
intensive operation in the map and no combine phase will want to set io.sort.mb 
low
to allocate more memory to the memory intensive operator and less to the 
combiner.  Adding this feature will greatly increase the utility of Pig for 
Hadoop users,
as it will free them from needing to understand Hadoop well enough to tune it 
themselves for their particular jobs.

'''Category:'''  Usability

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  large

==== Generated Execution Code ====
Currently Pig has a set of Physical Operators that contain the logic to execute 
Pig programs.  To execute a given program a pipeline of these physical operators
is constructed, split into Map Reduce jobs, and shipped to Hadoop.  We need to 
investigate changing the physical operators to instead understand how to 
generate
Java code.  Pig can then generate Java code, compile it, and pass that to 
Hadoop.  Some sources we have read suggest that a significant performance 
improvement
could be gained.  Also this would allow Pig to use pre-compiled tuples specific 
to a given script, which should improve memory usage and performance.  This 
would
make the code more complex to develop and maintain.  It would also make is more 
complex to install as it would require a Java compiler as part of the Pig
deployment.

'''Category:'''  New Functionality

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  large

==== Integration with Avro ====
Pig needs to investigate using Avro for transferring data between MR jobs, in 
lieu of Pig's current !BinStorage.  It has also been suggested that we use
Avro for serializing non-data objects (such as pipelines, function specs, 
etc.).  The costs and benefits of this need to be investigated as well.

'''Category:'''  Integration

'''Dependency:'''

'''References:''' [[http://issues.apache.org/jira/browse/PIG-794|PIG-794]] 
contains a prototype for replacing !BinStorage with Avro.  At the time this was 
done the Avro
implementation was no faster than !BinStorage.  

'''Estimated Development Effort:'''  small

==== Integration with Oozie ====
It has been suggested that Pig should be able to generate Oozie jobs in 
addition to (or perhaps instead of) directly generating Map Reduce jobs.  It 
has also been
suggested that Pig Latin should include commands to control Oozie, thus 
allowing Pig Latin to be a language for workflows on Hadoop.   The Pig team 
needs to consider
these options and decide how Pig and Oozie should be integrated. 

'''Category:'''  Integration

'''Dependency:'''

'''References:'''

'''Estimated Development Effort:'''  depends on what type of integration is 
chosen

==== Run Map Reduce Jobs Directly From Pig ====
It would be very useful to be able to run arbitrary Map Reduce jobs from inside 
Pig.  This would look something like:

{{{
    A = load 'myfile' as (user, url);
    B = filter A by notABot(user);
    C = native B {jar=mr.jar ...} as (user, url, estimated_value);
    D = group C by user;
    E = foreach D generate user, SUM(C.estimated_value);
    store E into 'output';
}}}

The semantics would be that before the native command, Pig would write output 
to an HDFS file.  That file would then be input to the native program.  The 
native
program would also dump output to HDFS, which would become the input for the 
next Pig operation.

This allows users to integrate legacy MR functionality as well as functionality 
that is better written in MR with their pig scripts. 

'''Category:'''  Integration

'''Dependency:'''

'''References:'''  [[http://issues.apache.org/jira/browse/PIG-506|PIG-506]]

'''Estimated Development Effort:'''  medium

Reply via email to