[
https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olga Natkovich updated PIG-807:
-------------------------------
Fix Version/s: 0.6.0
Assignee: Ying He
> PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the
> Hadoop values iterator)
> ------------------------------------------------------------------------------------------------
>
> Key: PIG-807
> URL: https://issues.apache.org/jira/browse/PIG-807
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.2.1
> Reporter: Pradeep Kamath
> Assignee: Ying He
> Fix For: 0.6.0
>
>
> Currently all bags resulting from a group or cogroup are materialized as bags
> containing all of the contents. The issue with this is that if a particular
> key has many corresponding values, all these values get stuffed in a bag
> which may run out of memory and hence spill causing slow down in performance
> and sometime memory exceptions. In many cases, the udfs which use these bags
> coming out a group and cogroup only need to iterate over the bag in a
> unidirectional read-once manner. This can be implemented by having the bag
> implement its iterator by simply iterating over the underlying hadoop
> iterator provided in the reduce. This kind of a bag is also needed in
> http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for
> this issue too. The other part of this issue is to have some way for the udfs
> to communicate to Pig that any input bags that they need are "read once" bags
> . This can be achieved by having an Interface - say "UsesReadOnceBags " which
> is serves as a tag to indicate the intent to Pig. Pig can then rewire its
> execution plan to use ReadOnceBags is feasible.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.