[
https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556616#action_12556616
]
Alan Gates commented on PIG-30:
-------------------------------
Responses to Utkarsh's comments:
0. TreeSet.add() only adds an element if it is not already present (see
http://java.sun.com/j2se/1.5.0/docs/api/java/util/TreeSet.html#add(E)). This
guarantees that the element already in the tree will not be obliterated.
That's why if that call returns false, the code goes back and rereads from the
file it read the last element from. This guarantees that we read from that
file until either the file is empty or we find a new unique element to put in
the TreeSet.
1. Good catch, I'll add a hashcode() implementation for DataBag.
2. They aren't quite as combinable as they first appear. The code in next()
is identical, and could be combined. DistinctDataBag.readFromTree() and
SortedDataBag.readFromPriorityQ() create different containers and access them
differently. I could put just the create and access methods in each and
combine the rest of the logic. The addToQueue() functions in each are
different and have different logic about how to add an element to the queue.
I can work on this, but it may be a bit before I get to it.
> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
> Key: PIG-30
> URL: https://issues.apache.org/jira/browse/PIG-30
> Project: Pig
> Issue Type: Bug
> Components: data
> Reporter: Benjamin Reed
> Assignee: Alan Gates
> Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use
> BigDataBag. I think we already do this. The problem is that the logic in
> BigDataBag is hard to follow and it is made more complicated because it
> subclasses DataBag. We should merge these two classes together.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.