Re: [Pig Wiki] Update of "ProposedProjects" by AlanGates

nitesh bhatia Thu, 16 Apr 2009 00:55:27 -0700

Hi
Can you briefly explain what is required in the first project? After reading
the description my impression is, currently when we are executing commands
on Pig Shell, Pig is first converting to map-reduce jobs and then feeding it
to hadoop. In this project are we proposing that, the execution plan made by
Pig will be first converted to a java file for map-reduce procedure and then
feed onto hadoop network ?


If this is the case then I am sure it will be great help to users as this
functionality can be used to write complicated map-reduce jobs very easily.
Initially user can write the Pig scripts / commands required for his job and
get the map-reduce java files. Then he can edit map-reduce files to extend
the functionality  and add extra procedures that are not provided by Pig but
can be executed over hadoop.

--nitesh

On Wed, Apr 15, 2009 at 9:57 PM, Apache Wiki <wikidi...@apache.org> wrote:

> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Pig Wiki" for
> change notification.
>
> The following page has been changed by AlanGates:
> http://wiki.apache.org/pig/ProposedProjects
>
> New page:
> = Proposed Pig Projects =
> This page describes projects what we (the committers) would like to see
> added
> to Pig.  The scale of these projects vary, but they are larger projects,
> usually on the weeks or months scale.  We have not yet filed
> [https://issues.apache.org/jira/browse/PIG JIRAs] for some of these
> because they are still in the vague idea stage.  As they become more
> concrete,
> [https://issues.apache.org/jira/browse/PIG JIRAs] will be filed for them.
>
> We welcome contributers to take on one of these projects.  If you would
> like
> to do so, please file a JIRA (if one does not already exist for the
> project)
> with a proposed solution.  Pig's committers will work with you from there
> to
> help refine your solution.  Once a solution is agreed upon, you can begin
> implementation.
>
> If you see a project here that you would like to see Pig implement but you
> are
> not in a position to implement the solution right now, feel free to vote
> for
> the project.  Add your name to the list of supporters.  This will help
> contributers looking for a project to select one that will benefit many
> users.
>
> If you would like to propose a project for Pig, feel free to add to this
> list.
> If it is a smaller project, or something you plan to begin work on
> immediately, filing a [https://issues.apache.org/jira/browse/PIG JIRA] is
> a better route.
>
> || Catagory || Project || JIRA || Proposed By || Votes For ||
> || Execution || Pig currently executes scripts by building a pipeline of
> pre-built operators and running data through those operators in map reduce
> jobs.  We need to investigate instead have Pig generate java code specific
> to a job, and then compiling that code and using it to run the map reduce
> jobs. || || Many conference attendees || gates ||
> || Language || Currently only DISTINCT, ORDER BY, and FILTER are allowed
> inside FOREACH.  All operators should be allowed in FOREACH. (Limit is being
> worked on [https://issues.apache.org/jira/browse/PIG-741 741] || || gates
> || ||
> || Optimization || Speed up comparison of tuples during shuffle for ORDER
> BY || [https://issues.apache.org/jira/browse/PIG-659 659] || olgan || ||
> || Optimization || Order by should be changed to not use POPackage to put
> all of the tuples in a bag on the reduce side, as the bag is just
> immediately flattened.  It can instead work like join does for the last
> input in the join. || || gates || ||
> || Optimization || Often in a Pig script that produces a chain of MR jobs,
> the map phases of 2nd and subsequent jobs very little.  What little they do
> should be pushed into the proceeding reduce and the map replaced by the
> identity mapper.  Initial tests showed that the identity mapper was 50%
> faster than using a Pig mapper (because Pig uses the loader to parse out
> tuples even if the map itself is empty). || [
> https://issues.apache.org/jira/browse/PIG-480 480] || olgan || gates ||
> || Optimization || Use hand crafted calls to do string to integer or float
> conversions.  Initial tests showed these could be done about 8x faster than
> String.toIntger() and String.toFloat(). || [
> https://issues.apache.org/jira/browse/PIG-482 482] || olgan || gates ||
> || Optimization || Currently Pig always samples for and ORDER BY to
> determine how to partition, and then runs another job to do the sort.  For
> small enough inputs, it should just sort with a single reducer. || [
> https://issues.apache.org/jira/browse/PIG-483 483] || olgan || ||
> || Optimization || In many cases data to be joined is already sorted and
> partitioned on the same key.  Pig needs to be able to take advantage of this
> and do these joins in the map.  The join could be done by sampling one input
> to determine the value of the join key at the beginning of every HDFS block.
>  This would form an index.  Then in a second MR job can be run with the
> other input.  Based on the key seen in the second input, the appropriate
> blocks of the first input can also be loaded into the map and the join done.
> || || gates || ||
> || Optimization || The combiner is not currently used if FILTER is in the
> FOREACH.  In some cases it could still be used.  || [
> https://issues.apache.org/jira/browse/PIG-479 479] || olgan || ||
> || Optimization || Currently when types of data are declared Pig inserts a
> FOREACH immediately after the LOAD that does the conversions.  These
> conversions should be delayed until the field is actually used. || [
> https://issues.apache.org/jira/browse/PIG-410 410] || olgan || gates ||
> || Optimization || When an order by is not the only operation in a pig
> script, it is done in two additional MR jobs.  The first job samples using a
> sampling loader, the second does the sort.  The sample is used to construct
> a partitioner that equally balances the data in the sort.  The sampler needs
> to be changed to be a !EvalFunc instead of a loader.  This way a split can
> be but in the proceeding MR job, with the main data being written out and
> the other part flowing to the sampler func, which can then write out the
> sample.  The final MR job can then be the sort. || || gates || ||
> || Optimization || When an order by is the only operation in a pig script
> it is currently done in 3 MR jobs.  The first converts it to BinStorage
> format (because the sample loader reads that format), the second samples,
> and the third sorts.  Once the changes mentioned above to make the sampler
> an !EvalFunc are done it should be changed to be done in 2 MR jobs instead
> of 3. || [https://issues.apache.org/jira/browse/PIG-460  460] || gates ||
> ||
> || Optimization || The Pig optimizer should be used to determine when
> fields in a record are no longer needed and put in FOREACH statements to
> project out the unecessary data as early as possible. || [
> https://issues.apache.org/jira/browse/PIG-466 466] || olgan || ||
> || Optimization || The Pig optimizers needs to call fieldsToRead so that
> Load functions that can do column skipping do it. || || gates || ||
> || Scalability || Pig's default join (symmetric hash) currently depends on
> being able to fit all of the values for a given join key for one of the
> inputs into memory.  (It does try to spill to disk in the case where it
> cannot fit them all into memory.  In practice this often fails as it is not
> good at understanding when memory is low enough that it should spill.  Even
> in the case where it does not fail, spilling to disk and rereading from disk
> is very slow.)  If instances of keys with a large number of values were
> broken up so that the row set could fit in memory and then shipped to
> multiple reducers.  A sampling pass would need to be done first to determine
> which keys to break up. || || chris olston || gates ||
>



-- 
Nitesh Bhatia
Dhirubhai Ambani Institute of Information & Communication Technology
Gandhinagar
Gujarat

"Life is never perfect. It just depends where you draw the line."

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun

Re: [Pig Wiki] Update of "ProposedProjects" by AlanGates

Reply via email to