[jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Alan Gates (JIRA) Mon, 07 Jan 2008 08:30:55 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556616#action_12556616
 ]


Alan Gates commented on PIG-30:
-------------------------------

Responses to Utkarsh's comments:

0.  TreeSet.add() only adds an element if it is not already present (see 
http://java.sun.com/j2se/1.5.0/docs/api/java/util/TreeSet.html#add(E)).  This 
guarantees that the element already in the tree will not be obliterated.  
That's why if that call returns false, the code goes back and rereads from the 
file it read the last element from.  This guarantees that we read from that 
file until either the file is empty or we find a new unique element to put in 
the TreeSet.

1.  Good catch, I'll add a hashcode() implementation for DataBag.

2.  They aren't quite as combinable as they first appear.  The code in next() 
is identical, and could be combined.  DistinctDataBag.readFromTree() and 
SortedDataBag.readFromPriorityQ() create different containers and access them 
differently.  I could put just the create and access methods in each and 
combine the rest of the logic.  The addToQueue() functions in each are 
different and have different logic about how to add an element to the queue.   
I can work on this, but it may be a bit before I get to it.

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use 
> BigDataBag. I think we already do this. The problem is that the logic in 
> BigDataBag is hard to follow and it is made more complicated because it 
> subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Reply via email to