[
https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556172#action_12556172
]
Utkarsh Srivastava commented on PIG-30:
---------------------------------------
Great job! This was a fairly large chunk of work.
It will be nice to have a few more comments. Specifically, one part that is
implicit is that bag behavior is undefined if you add() to Databag after
opening an iterator(). Alan and I talked about this.
Other issues:
0. TreeSet used in DistinctBag while merging files. But TContainer compares
only based on tuple equality. Once you add a tuple equal to the one already in
the treeset but from another input, one of the inputs will get eliminated from
the treeset and never be read again. Am I missing something?
1. HashSet<> in DistinctBag. For hash set to work properly we need hashcode()
methods to work properly. Since Tuple.hashcode() calls hashcode() on all its
fields, all Datums should have a hash code. Databag doesn't have one which
implies that DistinctBag wont work with nested data.
2. Spill() code in DistinctBag and sortedbag() is the same except that the
former always uses the default comparator whereas sortedBag might use a
specified comparator. Can we reuse code instead of duplicating?
> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
> Key: PIG-30
> URL: https://issues.apache.org/jira/browse/PIG-30
> Project: Pig
> Issue Type: Bug
> Components: data
> Reporter: Benjamin Reed
> Assignee: Alan Gates
> Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use
> BigDataBag. I think we already do this. The problem is that the logic in
> BigDataBag is hard to follow and it is made more complicated because it
> subclasses DataBag. We should merge these two classes together.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.