Size of DistinctDataBag is calculated incorrectly if spill occurs and
non-distinct elements are inserted
--------------------------------------------------------------------------------------------------------
Key: PIG-342
URL: https://issues.apache.org/jira/browse/PIG-342
Project: Pig
Issue Type: Bug
Affects Versions: 0.1.0
Reporter: Brandon Dimcheff
Attachments: size.patch
If a spill occurs while elements are being inserted into a DistinctDataBag,
it's possible that non-unique items will be added to the in-memory data
structure, and the mSize counter will be incremented. If the same elements
also exist on disk, the count will be higher than it should be.
The following is copied from an email exchange I had with Alan Gates:
Alan,
Thanks for your help. I've done a bit more experimentation and have discovered
a couple more things. I first looked at how COUNT was implemented. It looks
like COUNT calls size() on the bag, which will return mSize. I thought that
mSize might be calculated improperly so I added "SUM(unique_ids) AS
crazy_userid_sum" to my GENERATE line and re-ran the pigfile:
GENERATE FLATTEN(group), SUM(nice_data.duration) AS total_duration,
COUNT(nice_data) AS channel_switches, COUNT(unique_ids) AS unique_users,
SUM(unique_ids) AS crazy_userid_sum;
It turns out that the SUM generates the correct result in all cases, while
there are still occasional errors in the COUNT. Since SUM requires an
iteration over all the elements in the DistinctDataBag, this led me to believe
that the uniqueness constraint is indeed operating correctly, but there is some
error in the logic that calculates mSize.
Then I started poking around in DistinctDataBag looking for anything that
changes mSize that might be incorrect. I noticed that on line 87 in addAll(),
the size of the DataBag that is passed into the method is added to the mSize
instance variable, and then during the iteration a few lines later mSize is
being incremented when an element is successfully added to mContents. I
thought this might be the problem, since it seems like elements would be double
counted if addAll() was called. I commented out line 87, recompiled Pig, and
ran it again, but there are still errors (though I do think line 87 might be
incorrect anyways).
Thanks to my coworker Marshall, I think we may have discovered what the actual
problem is. The scenario is as follows: We're adding a bunch of stuff to the
bag, and before we're finished a spill occurs. mContents is cleared during the
spill (line 157). All add() does is check uniqueness against mContents. So
now we will get duplicates in mContents that are already on disk and an
inflated mSize. Now, the reason why SUM works is because the iterator is smart
and enforces uniqueness as it reads the records back in. We think this occurs
at the beginning of addToQueue, around line 363 - 369. mMergeTree is a
TreeSet, so it'll enforce uniqueness and the call to addToQueue is aborted if
there's already a matching record in mMergeTree.
Do you think our assessment is correct? If so, it seems that the calculation
of mSize needs to be significantly more complex than it is now. It looks to me
like the entire bag will need to be iterated in order to reliably calculate the
size. Do you have any ideas about how to implement this in a less expensive
way? I'd be happy to take a stab at it, but I don't want to do anything
particularly silly if you have a better idea.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.