PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop 
values iterator)

                 Key: PIG-807
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.2.1
            Reporter: Pradeep Kamath
             Fix For: 0.3.0

Currently all bags resulting from a group or cogroup are materialized as bags 
containing all of the contents. The issue with this is that if a particular key 
has many corresponding values, all these values get stuffed in a bag which may 
run out of memory and hence spill causing slow down in performance and sometime 
memory exceptions. In many cases, the udfs which use these bags coming out a 
group and cogroup only need to iterate over the bag in a unidirectional 
read-once manner. This can be implemented by having the bag implement its 
iterator by simply iterating over the underlying hadoop iterator provided in 
the reduce. This kind of a bag is also needed in So the code can be reused for 
this issue too. The other part of this issue is to have some way for the udfs 
to communicate to Pig that any input bags that they need are "read once" bags . 
This can be achieved by having an Interface - say "UsesReadOnceBags " which is 
serves as a tag to indicate the intent to Pig. Pig can then rewire its 
execution plan to use ReadOnceBags is feasible.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to