Mridul Muralidharan commented on PIG-807:

I think I am missing something here.

If I did not get it wrong, two (different ?) usecases seem to be mentioned here 

1) Avoid materializing bag's for a record when it can be streamed from the 
underlying data.
bag's currently created through (co)group output seems to fall inside this.
As in :
B = GROUP A by id;
C = FOREACH B generate SUM($1.field);

This does not reqiure the $1.field bag to be created explicitly - but through 
an iterator interface, just stream the values from underlying reducer output.

2) The group ALL based construct seem to be to directly stream an entire 
relation through udf's.
As a shorthand for 
A_tmp = GROUP A all;
B = FOREACH A_tmp GENERATE algUdf($1);

If I am right in splitting this, then :

First usecase has tremendous potential for improving performance - particularly 
to remove the annoying OOM's or spills which happen : but not sure how it 
interact with pig's current pipeline design... (if any).

Since there are alternatives (though more cryptic) to do it, I dont have any 
particular opinion about 2.


> PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the 
> Hadoop values iterator)
> ------------------------------------------------------------------------------------------------
>                 Key: PIG-807
>                 URL: https://issues.apache.org/jira/browse/PIG-807
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.1
>            Reporter: Pradeep Kamath
>             Fix For: 0.3.0
> Currently all bags resulting from a group or cogroup are materialized as bags 
> containing all of the contents. The issue with this is that if a particular 
> key has many corresponding values, all these values get stuffed in a bag 
> which may run out of memory and hence spill causing slow down in performance 
> and sometime memory exceptions. In many cases, the udfs which use these bags 
> coming out a group and cogroup only need to iterate over the bag in a 
> unidirectional read-once manner. This can be implemented by having the bag 
> implement its iterator by simply iterating over the underlying hadoop 
> iterator provided in the reduce. This kind of a bag is also needed in 
> http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for 
> this issue too. The other part of this issue is to have some way for the udfs 
> to communicate to Pig that any input bags that they need are "read once" bags 
> . This can be achieved by having an Interface - say "UsesReadOnceBags " which 
> is serves as a tag to indicate the intent to Pig. Pig can then rewire its 
> execution plan to use ReadOnceBags is feasible.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to