[jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Utkarsh Srivastava (JIRA) Fri, 04 Jan 2008 20:44:55 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556172#action_12556172
 ]


Utkarsh Srivastava commented on PIG-30:
---------------------------------------

Great job! This was a fairly large chunk of work.

It will be nice to have a few more comments. Specifically, one part that is 
implicit is that bag behavior is undefined if you add() to Databag after 
opening an iterator(). Alan and I talked about this.

Other issues:

0. TreeSet used in DistinctBag while merging files. But TContainer compares 
only based on tuple equality. Once you add a tuple equal to the one already in 
the treeset but from another input, one of the inputs will get eliminated from 
the treeset and never be read again. Am I missing something?

1. HashSet<> in DistinctBag. For hash set to work properly we need hashcode() 
methods to work properly. Since Tuple.hashcode() calls hashcode() on all its 
fields, all Datums should have a hash code. Databag doesn't have one which 
implies that DistinctBag wont work with nested data.

2. Spill() code in DistinctBag and sortedbag() is the same except that the 
former always uses the default comparator whereas sortedBag might use a 
specified comparator. Can we reuse code instead of duplicating?



> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use 
> BigDataBag. I think we already do this. The problem is that the logic in 
> BigDataBag is hard to follow and it is made more complicated because it 
> subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag

Reply via email to