I have been looking into a pretty nasty bug, and while I haven't been able to reproduce it outside of our dataset (I need to do more work on trying to make that happen). Prepare to enter crazytown. This bug exists on pig8 and pig9. The bug happens about 50% of the time on both. It ALWAYS affects the same key, though the partition the keys are sent to vary.
I have a flow that looks like this: x_and_y = foreach somedata generate source, sink; x_and_y_grouped = group x_and_y by sink; x_and_y_foreach = foreach x_and_y_grouped generate group as key, COUNT(x_and_y) as ct, x_and_y.source; store x_and_y_foreach into 'full'; x_and_y_pared_down = foreach x_and_y_foreach generate key, ct; store x_and_y_pared_down into 'pared_down'; x_and_y_foreach_all = group x_and_y_foreach all; x_and_y_foreach_stat = foreach x_and_y_foreach_all generate MAX(x_and_y_foreach.key) as max_key, COUNT(x_and_y_foreach) as count, SUM(x_and_y_foreach.ct) as sum; store x_and_y_foreach_stat into 'sum'; Ok, here is where things get crazy: ~50% of the time, 'pared_down' will have more rows than 'full.' Yeah. And x_and_y_foreach_stat will have the wrong count. Looking at the output files, there is a key that in one output part file, is (key,correct_count). And in another output part file, is (key,). I have done many things to see what could cause this. I turned off all optimizations, I turned off speculative execution, I turned off multiquery optimization, I did this all in pig8 and pig9...and got the same error. More crazy: If we do the exact same but group on source instead of sink, we haven't gotten the error yet. Anyone have any ideas what this may be related to? Seen anything similar? I'm going to try and reproduce on a non-proprietary data set, but given that nobody has complained about this before, I imagine it's a really weird corner case somewhere. Thanks Jon
