[jira] Updated: (PIG-30) Get rid of DataBag and always use BigDataBag

Alan Gates (JIRA) Thu, 03 Jan 2008 09:45:57 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alan Gates updated PIG-30:
--------------------------

    Attachment: bagrewrite.patch

The attached patch file contains a rewrite of DataBag in line with the proposal 
given in previous comments.  Highlights include:

    * DataBag has been entirely rewritten.  As part of this the interface has 
been brought into line with standard java container interface (size() instead 
of cardinality() and iterator() instead of content()).  cardinality() and 
content() have been kept for backward compatibility but marked as deprecated.  
Also as part of this change, DataBag has become an abstract class.  Also, 
functionality to sort and apply distinct to a bag have been removed.  This 
functionality is now provided by subclasses instead.

    * BigDataBag has been removed.  All data bags can now spill to disk when 
necessary.

    * DefaultDataBag, SortedDataBag, and DistinctDataBag have been added.  Each 
of these extends DataBag.

    * BagFactory has been entirely rewritten.  As part of this its interface 
has been changed in a non-backward compatible way.  Now the caller must specify 
up front what type of bag (default, sorted, distinct) is desired, and the 
appropriate type of bag will be provided.  In making these changes I assumed 
that users never directly call BagFactory, and thus changing the interface 
won't break any UDFs.  If this assumption is wrong, please let me know.

    * Spillable interface has been added.  This interface says that an 
implementing class can be asked by the system to spill its contents to the 
disk.  DataBag implements Spillable.

    * SpillableMemoryManager has been added (courtesy of Ben).  This memory 
manager registers with the JVM to be called when the largest memory pool 
becomes more than 50% full.  It then goes through its list of Spillable objects 
and asks them to spill.

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use 
> BigDataBag. I think we already do this. The problem is that the logic in 
> BigDataBag is hard to follow and it is made more complicated because it 
> subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-30) Get rid of DataBag and always use BigDataBag

Reply via email to