[
https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates updated PIG-30:
--------------------------
Attachment: bagrewrite.patch
The attached patch file contains a rewrite of DataBag in line with the proposal
given in previous comments. Highlights include:
* DataBag has been entirely rewritten. As part of this the interface has
been brought into line with standard java container interface (size() instead
of cardinality() and iterator() instead of content()). cardinality() and
content() have been kept for backward compatibility but marked as deprecated.
Also as part of this change, DataBag has become an abstract class. Also,
functionality to sort and apply distinct to a bag have been removed. This
functionality is now provided by subclasses instead.
* BigDataBag has been removed. All data bags can now spill to disk when
necessary.
* DefaultDataBag, SortedDataBag, and DistinctDataBag have been added. Each
of these extends DataBag.
* BagFactory has been entirely rewritten. As part of this its interface
has been changed in a non-backward compatible way. Now the caller must specify
up front what type of bag (default, sorted, distinct) is desired, and the
appropriate type of bag will be provided. In making these changes I assumed
that users never directly call BagFactory, and thus changing the interface
won't break any UDFs. If this assumption is wrong, please let me know.
* Spillable interface has been added. This interface says that an
implementing class can be asked by the system to spill its contents to the
disk. DataBag implements Spillable.
* SpillableMemoryManager has been added (courtesy of Ben). This memory
manager registers with the JVM to be called when the largest memory pool
becomes more than 50% full. It then goes through its list of Spillable objects
and asks them to spill.
> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
> Key: PIG-30
> URL: https://issues.apache.org/jira/browse/PIG-30
> Project: Pig
> Issue Type: Bug
> Components: data
> Reporter: Benjamin Reed
> Assignee: Alan Gates
> Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use
> BigDataBag. I think we already do this. The problem is that the logic in
> BigDataBag is hard to follow and it is made more complicated because it
> subclasses DataBag. We should merge these two classes together.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.