[ 
https://issues.apache.org/jira/browse/PIG-672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673069#action_12673069
 ] 

Daniel Lescohier commented on PIG-672:
--------------------------------------

I tested with trunk 743881, and the problem is solved.  I recreated the hashes 
streaming through sha1.py, and then ran the dupe-checking query.  The output 
was 0 bytes.

set job.name 'title hash 743881';
DEFINE Cmd `sha1.py` ship('sha1.py');
row = load '/home/danl/url_title/unique_titles';
hashes = stream row through Cmd;
store hashes into '/home/danl/url_title/title_hash.743881';

set job.name 'h40.011.nh';
hash = load '/home/danl/url_title/title_hash.743881';
grouped = group hash by $0 parallel 7;
counted = foreach grouped generate group, COUNT(hash) as cnt;
having = filter counted by cnt > 1;
store having into '/home/danl/url_title/title_hash_collisions/h40.743881';


I thought of a theoretical unit test.  Run an input file through this, and see 
if the output file diffs at all from the input file:

I = load 'seq.in.txt';
CAT = stream I through `cat`;
store CAT into 'seq.out.txt';

I did this in pig -x local mode, but it never had a problem; but I realize now 
from PIG-645 that it only occurred in mapreduce mode.


> bad data output from STREAM operator in trunk (regression from 0.1.1)
> ---------------------------------------------------------------------
>
>                 Key: PIG-672
>                 URL: https://issues.apache.org/jira/browse/PIG-672
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>         Environment: Red Hat Enterprise Linux 4 & 5
> Hadoop 0.18.2
>            Reporter: Daniel Lescohier
>            Priority: Critical
>
> In the 0.1.1 release of pig, all of the following works fine; the problem is 
> in the trunk version.  Here's a brief intro to the workflow (details below):
>  * I have 174856784 lines of input data, each line is a unique title string.
>  * I stream the data through `sha1.py`, which outputs a sha1 hash of each 
> input line: a string of 40 hexadecimal digits.
>  * I group on the hash, generating a count of each group, then filter on rows 
> having a count > 1.
>  * With pig 0.1.1, it outputs all 0-byte part-* files, because all the hashes 
> are unique.
>  * I've also verified totally outside of Hadoop, using sort and uniq, that 
> the hashes are unique.
>  * pig trunk checkout with "last changed rev 737863" returns non-zero 
> results; the 7 part-* files are 1.5MB.
>  * I've tracked it down to the STREAM operation (details below).
> Here's the pig-svn-trunk job that produces the hashes:
> set job.name 'title hash';
> DEFINE Cmd `sha1.py` ship('sha1.py');
> row = load '/home/danl/url_title/unique_titles';
> hashes = stream row through Cmd;
> store hashes into '/home/danl/url_title/title_hash';
> Here's the pig-0.1.1 job that produces the hashes:
> set job.name 'title hash 011';
> DEFINE Cmd `sha1.py` ship('sha1.py');
> row = load '/home/danl/url_title/unique_titles';
> hashes = stream row through Cmd;
> store hashes into '/home/danl/url_title/title_hash.011';
> Here's sha1.py:
> #!/opt/cnet-python/default-2.5/bin/python
> from sys import stdin, stdout
> from hashlib import sha1
> for line in stdin:
>     h = sha1()
>     h.update(line[:-1])
>     stdout.write("%s\n" % h.hexdigest())
> Here's the pig-svn-trunk job for finding duplicate hashes from the hashes 
> data generated by pig-svn-trunk:
> set job.name 'h40';
> hash = load '/home/danl/url_title/title_hash';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40 are 
> 1.5MB each.
> Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data 
> generated by pig-0.1.1:
> set job.name 'h40.011.nh';
> hash = load '/home/danl/url_title/title_hash.011';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.011.nh';
> The seven part-* files in 
> /home/danl/url_title/title_hash_collisions/h40.011.nh are 0KB each.
> Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data 
> generated by pig-svn-trunk:
> set job.name 'h40.011';
> hash = load '/home/danl/url_title/title_hash';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.011';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011 
> are 1.5MB each.
> Therefore, it's the hash data generated by pig-svn-trunk 
> (/home/danl/url_title/title_hash) which has duplicates in it.
> Here are the first six lines of /home/danl/url_title/title_hash/part-00064.  
> You can see that lines five and six are duplicates.  It looks like the stream 
> operator read the same line twice from the Python program? The job which 
> produces the hashes is a map-only job, no reduces.
> 8f3513136b1c8b87b8b73b9d39d96555095e9cdd
> 2edb20c5a3862cc5f545ae649f1e26430a38bda4
> ca9c216629fce16b4c113c0d9fcf65f906ab5e04
> 03fe80633822215a6935bcf95305bb14adf23f18
> 03fe80633822215a6935bcf95305bb14adf23f18
> 6d324b2cd1c52f564e2a29fcbf6ae2fb83d2697c
> After narrowing it down to the stream operator in pig-svn-trunk, I decided to 
> run the find dupes job again using pig-svn-trunk, but to first pipe the data 
> through cat.  Cat shouldn't change the data at all, it's an identity 
> operation.  Here's the job:
> set job.name 'h40.cat';
> DEFINE Cmd `cat`;
> row = load '/home/danl/url_title/title_hash';
> hash = stream row through Cmd;
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.cat';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.cat 
> are 7.2MB each. This 'h40.cat' job should produce the same results as the 
> 'h40' job.  The 'h40' job had part-* files of 1.5MB each, and this job 7.2MB 
> each, so piping it through `cat` produced even more duplicates, when `cat` is 
> not supposed to change the results at all.
> I also ran under pig-svn-trunk a 'title hash.r2' job that re-ran creating the 
> hashes again into another directory, just to make sure it wasn't a fluke run 
> that produced duplicate hashes. The second time around, it also produced 
> duplicates.  Running the dupe-detection pig job under svn-0.1.1 for the 
> hashes produced by pig-svn-trunk from the second run, I also got 1.5MB output 
> files.
> For a final test, I ran in pig-svn-trunk the dupe-detection code on hash data 
> generated from pig-0.1.1:
> set job.name 'h40.trk.nh';
> hash = load '/home/danl/url_title/title_hash.011';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.trk.nh';
> The seven part-* files in 
> /home/danl/url_title/title_hash_collisions/h40.trk.nh are 0 bytes.  It's 
> clear that it's the stream operation running in pig-svn-trunk which is 
> producing the duplicates.
> Here is the complete svn info of the checkout I built pig from:
> Path: .
> URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 737873
> Node Kind: directory
> Schedule: normal
> Last Changed Author: pradeepkth
> Last Changed Rev: 737863
> Last Changed Date: 2009-01-26 13:27:16 -0800 (Mon, 26 Jan 2009)
> When I built it, I also ran all the unit tests.
> This was all run on Hadoop 0.18.2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to