[ https://issues.apache.org/jira/browse/PIG-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates reassigned PIG-342: ------------------------------ Assignee: Brandon Dimcheff > Size of DistinctDataBag is calculated incorrectly if spill occurs and > non-distinct elements are inserted > -------------------------------------------------------------------------------------------------------- > > Key: PIG-342 > URL: https://issues.apache.org/jira/browse/PIG-342 > Project: Pig > Issue Type: Bug > Affects Versions: 0.1.0 > Reporter: Brandon Dimcheff > Assignee: Brandon Dimcheff > Fix For: 0.1.0 > > Attachments: size.patch > > > If a spill occurs while elements are being inserted into a DistinctDataBag, > it's possible that non-unique items will be added to the in-memory data > structure, and the mSize counter will be incremented. If the same elements > also exist on disk, the count will be higher than it should be. > The following is copied from an email exchange I had with Alan Gates: > Alan, > Thanks for your help. I've done a bit more experimentation and have > discovered a couple more things. I first looked at how COUNT was > implemented. It looks like COUNT calls size() on the bag, which will return > mSize. I thought that mSize might be calculated improperly so I added > "SUM(unique_ids) AS crazy_userid_sum" to my GENERATE line and re-ran the > pigfile: > GENERATE FLATTEN(group), SUM(nice_data.duration) AS total_duration, > COUNT(nice_data) AS channel_switches, COUNT(unique_ids) AS unique_users, > SUM(unique_ids) AS crazy_userid_sum; > It turns out that the SUM generates the correct result in all cases, while > there are still occasional errors in the COUNT. Since SUM requires an > iteration over all the elements in the DistinctDataBag, this led me to > believe that the uniqueness constraint is indeed operating correctly, but > there is some error in the logic that calculates mSize. > Then I started poking around in DistinctDataBag looking for anything that > changes mSize that might be incorrect. I noticed that on line 87 in > addAll(), the size of the DataBag that is passed into the method is added to > the mSize instance variable, and then during the iteration a few lines later > mSize is being incremented when an element is successfully added to > mContents. I thought this might be the problem, since it seems like elements > would be double counted if addAll() was called. I commented out line 87, > recompiled Pig, and ran it again, but there are still errors (though I do > think line 87 might be incorrect anyways). > Thanks to my coworker Marshall, I think we may have discovered what the > actual problem is. The scenario is as follows: We're adding a bunch of > stuff to the bag, and before we're finished a spill occurs. mContents is > cleared during the spill (line 157). All add() does is check uniqueness > against mContents. So now we will get duplicates in mContents that are > already on disk and an inflated mSize. Now, the reason why SUM works is > because the iterator is smart and enforces uniqueness as it reads the records > back in. We think this occurs at the beginning of addToQueue, around line 363 > - 369. mMergeTree is a TreeSet, so it'll enforce uniqueness and the call to > addToQueue is aborted if there's already a matching record in mMergeTree. > Do you think our assessment is correct? If so, it seems that the calculation > of mSize needs to be significantly more complex than it is now. It looks to > me like the entire bag will need to be iterated in order to reliably > calculate the size. Do you have any ideas about how to implement this in a > less expensive way? I'd be happy to take a stab at it, but I don't want to > do anything particularly silly if you have a better idea. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.