[
https://issues.apache.org/jira/browse/PIG-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pradeep Kamath resolved PIG-672.
--------------------------------
Resolution: Invalid
Closing the issue based on the last comment. Please reopen if there is still an
issue.
> bad data output from STREAM operator in trunk (regression from 0.1.1)
> ---------------------------------------------------------------------
>
> Key: PIG-672
> URL: https://issues.apache.org/jira/browse/PIG-672
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Environment: Red Hat Enterprise Linux 4 & 5
> Hadoop 0.18.2
> Reporter: Daniel Lescohier
> Priority: Critical
>
> In the 0.1.1 release of pig, all of the following works fine; the problem is
> in the trunk version. Here's a brief intro to the workflow (details below):
> * I have 174856784 lines of input data, each line is a unique title string.
> * I stream the data through `sha1.py`, which outputs a sha1 hash of each
> input line: a string of 40 hexadecimal digits.
> * I group on the hash, generating a count of each group, then filter on rows
> having a count > 1.
> * With pig 0.1.1, it outputs all 0-byte part-* files, because all the hashes
> are unique.
> * I've also verified totally outside of Hadoop, using sort and uniq, that
> the hashes are unique.
> * pig trunk checkout with "last changed rev 737863" returns non-zero
> results; the 7 part-* files are 1.5MB.
> * I've tracked it down to the STREAM operation (details below).
> Here's the pig-svn-trunk job that produces the hashes:
> set job.name 'title hash';
> DEFINE Cmd `sha1.py` ship('sha1.py');
> row = load '/home/danl/url_title/unique_titles';
> hashes = stream row through Cmd;
> store hashes into '/home/danl/url_title/title_hash';
> Here's the pig-0.1.1 job that produces the hashes:
> set job.name 'title hash 011';
> DEFINE Cmd `sha1.py` ship('sha1.py');
> row = load '/home/danl/url_title/unique_titles';
> hashes = stream row through Cmd;
> store hashes into '/home/danl/url_title/title_hash.011';
> Here's sha1.py:
> #!/opt/cnet-python/default-2.5/bin/python
> from sys import stdin, stdout
> from hashlib import sha1
> for line in stdin:
> h = sha1()
> h.update(line[:-1])
> stdout.write("%s\n" % h.hexdigest())
> Here's the pig-svn-trunk job for finding duplicate hashes from the hashes
> data generated by pig-svn-trunk:
> set job.name 'h40';
> hash = load '/home/danl/url_title/title_hash';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40 are
> 1.5MB each.
> Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data
> generated by pig-0.1.1:
> set job.name 'h40.011.nh';
> hash = load '/home/danl/url_title/title_hash.011';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.011.nh';
> The seven part-* files in
> /home/danl/url_title/title_hash_collisions/h40.011.nh are 0KB each.
> Here's the pig-0.1.1 job for finding duplicate hashes from the hashes data
> generated by pig-svn-trunk:
> set job.name 'h40.011';
> hash = load '/home/danl/url_title/title_hash';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.011';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.011
> are 1.5MB each.
> Therefore, it's the hash data generated by pig-svn-trunk
> (/home/danl/url_title/title_hash) which has duplicates in it.
> Here are the first six lines of /home/danl/url_title/title_hash/part-00064.
> You can see that lines five and six are duplicates. It looks like the stream
> operator read the same line twice from the Python program? The job which
> produces the hashes is a map-only job, no reduces.
> 8f3513136b1c8b87b8b73b9d39d96555095e9cdd
> 2edb20c5a3862cc5f545ae649f1e26430a38bda4
> ca9c216629fce16b4c113c0d9fcf65f906ab5e04
> 03fe80633822215a6935bcf95305bb14adf23f18
> 03fe80633822215a6935bcf95305bb14adf23f18
> 6d324b2cd1c52f564e2a29fcbf6ae2fb83d2697c
> After narrowing it down to the stream operator in pig-svn-trunk, I decided to
> run the find dupes job again using pig-svn-trunk, but to first pipe the data
> through cat. Cat shouldn't change the data at all, it's an identity
> operation. Here's the job:
> set job.name 'h40.cat';
> DEFINE Cmd `cat`;
> row = load '/home/danl/url_title/title_hash';
> hash = stream row through Cmd;
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.cat';
> The seven part-* files in /home/danl/url_title/title_hash_collisions/h40.cat
> are 7.2MB each. This 'h40.cat' job should produce the same results as the
> 'h40' job. The 'h40' job had part-* files of 1.5MB each, and this job 7.2MB
> each, so piping it through `cat` produced even more duplicates, when `cat` is
> not supposed to change the results at all.
> I also ran under pig-svn-trunk a 'title hash.r2' job that re-ran creating the
> hashes again into another directory, just to make sure it wasn't a fluke run
> that produced duplicate hashes. The second time around, it also produced
> duplicates. Running the dupe-detection pig job under svn-0.1.1 for the
> hashes produced by pig-svn-trunk from the second run, I also got 1.5MB output
> files.
> For a final test, I ran in pig-svn-trunk the dupe-detection code on hash data
> generated from pig-0.1.1:
> set job.name 'h40.trk.nh';
> hash = load '/home/danl/url_title/title_hash.011';
> grouped = group hash by $0 parallel 7;
> counted = foreach grouped generate group, COUNT(hash) as cnt;
> having = filter counted by cnt > 1;
> store having into '/home/danl/url_title/title_hash_collisions/h40.trk.nh';
> The seven part-* files in
> /home/danl/url_title/title_hash_collisions/h40.trk.nh are 0 bytes. It's
> clear that it's the stream operation running in pig-svn-trunk which is
> producing the duplicates.
> Here is the complete svn info of the checkout I built pig from:
> Path: .
> URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 737873
> Node Kind: directory
> Schedule: normal
> Last Changed Author: pradeepkth
> Last Changed Rev: 737863
> Last Changed Date: 2009-01-26 13:27:16 -0800 (Mon, 26 Jan 2009)
> When I built it, I also ran all the unit tests.
> This was all run on Hadoop 0.18.2.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.