Ok, all sounds good.
On Jan 7, 2008, at 8:30 AM, Alan Gates (JIRA) wrote:
[ https://issues.apache.org/jira/browse/PIG-30?
page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel&focusedCommentId=12556616#action_12556616 ]
Alan Gates commented on PIG-30:
-------------------------------
Responses to Utkarsh's comments:
0. TreeSet.add() only adds an element if it is not already present
(see http://java.sun.com/j2se/1.5.0/docs/api/java/util/
TreeSet.html#add(E)). This guarantees that the element already in
the tree will not be obliterated. That's why if that call returns
false, the code goes back and rereads from the file it read the
last element from. This guarantees that we read from that file
until either the file is empty or we find a new unique element to
put in the TreeSet.
1. Good catch, I'll add a hashcode() implementation for DataBag.
2. They aren't quite as combinable as they first appear. The code
in next() is identical, and could be combined.
DistinctDataBag.readFromTree() and SortedDataBag.readFromPriorityQ
() create different containers and access them differently. I
could put just the create and access methods in each and combine
the rest of the logic. The addToQueue() functions in each are
different and have different logic about how to add an element to
the queue. I can work on this, but it may be a bit before I get
to it.
Get rid of DataBag and always use BigDataBag
--------------------------------------------
Key: PIG-30
URL: https://issues.apache.org/jira/browse/PIG-30
Project: Pig
Issue Type: Bug
Components: data
Reporter: Benjamin Reed
Assignee: Alan Gates
Attachments: bagrewrite.patch
We should never use DataBag directly; instead, we should always
use BigDataBag. I think we already do this. The problem is that
the logic in BigDataBag is hard to follow and it is made more
complicated because it subclasses DataBag. We should merge these
two classes together.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.