David Ciemiewicz commented on PIG-807:

I wonder if there is also the need for some additional classes of functions 
that go along with ReadOnce / Streaming applications:
Accumulating Functions that operated on ordered data and output a tuple for 
each and every tuple read.

For instance, cummulative sums, rank, dense rank, cumulative proportions all 
could be written Accumulating Functions that operate on streams.

>From my Perl example above, cummulative sum would be a function that does:

sub accumulate
        my $self = shift;
        my $value = shift;

        $self->{'sum'} += $value;

        return $self->{'sum'};

These kinds of functions would be different from the SUM, COUNT, MIN, MAX, .. 
Accumulating functions.

I think that any designs / redesigns of Pig to support ReadOnce data should 
also include consideration for these kinds of cumulative sum type functions as 

> PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the 
> Hadoop values iterator)
> ------------------------------------------------------------------------------------------------
>                 Key: PIG-807
>                 URL: https://issues.apache.org/jira/browse/PIG-807
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.1
>            Reporter: Pradeep Kamath
>             Fix For: 0.3.0
> Currently all bags resulting from a group or cogroup are materialized as bags 
> containing all of the contents. The issue with this is that if a particular 
> key has many corresponding values, all these values get stuffed in a bag 
> which may run out of memory and hence spill causing slow down in performance 
> and sometime memory exceptions. In many cases, the udfs which use these bags 
> coming out a group and cogroup only need to iterate over the bag in a 
> unidirectional read-once manner. This can be implemented by having the bag 
> implement its iterator by simply iterating over the underlying hadoop 
> iterator provided in the reduce. This kind of a bag is also needed in 
> http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for 
> this issue too. The other part of this issue is to have some way for the udfs 
> to communicate to Pig that any input bags that they need are "read once" bags 
> . This can be achieved by having an Interface - say "UsesReadOnceBags " which 
> is serves as a tag to indicate the intent to Pig. Pig can then rewire its 
> execution plan to use ReadOnceBags is feasible.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to