pig-user  

Incorrect DISTINCT Results

Brandon Dimcheff
Thu, 24 Jul 2008 10:53:54 -0700

Hello,

I'm attempting to run a Pig job on a Hadoop cluster with a 5GB/35 million row input. When run on sample data of 100k rows, I get the correct results, but when I run it on the whole dataset, some of the distinct counts are incorrect. The pigfile (field names and input schema changed slightly to protect the innocent) is below:

register pigutil.jar
raw_data = LOAD '/input/user_sessions.tsv' AS (userid,timestamp,location,duration); nice_data = FOREACH raw_data GENERATE userid, pigutil.DateFromUnixTimestamp(timestamp) as date, location, duration;
report = FOREACH (GROUP nice_data BY (date,location)) {
        unique_ids = DISTINCT nice_data.userid;
GENERATE FLATTEN(group), SUM(nice_data.duration) AS total_duration, COUNT(nice_data) AS hits, COUNT(unique_ids) AS unique_users;
}
STORE report into '/output/user_statistics';

Some info about the data and the errors:

* ~540 result rows total
* total_duration and hits are calculated properly
* There are generally between 5 and 10 incorrect results for unique_users per run * In all cases, the number of unique users reported are greater than the correct number * Each run produces errors for different rows, and no two runs has produced exactly the same incorrect data * There are a lot of log messages from SpillableMemoryManager in the reduce phase about low memory handlers being called (both collection and usage threshold being exceeded) * The error rate *seems* to decrease with an increase in memory (and increase with a decrease in memory). We don't have enough data samples to be sure that this is the case. The memory was set using mapred.child.java.opts.

From what I can tell by my (very limited) knowledge of Pig's codebase, it seems like the problem might be occurring in the DistinctDataBag somewhere. Perhaps the uniqueness constraint somehow gets lost somewhere in the spilling logic. I'm not really sure where to go from here, since the code in DistinctDataBag is rather complex. Has anyone else had these problems? Is there someone I can work with who is familiar with the DistinctDataBag code to try to track down what's causing these errors?

Thanks,
Brandon