I'm having a weird issue. When I invoke my mapreduce with a secondary sort using the KeyFieldBasedPartitioner, it's altering lines containing backslashes. Or I've made some foolish conceptual error and my script is doing so, but only when there's a partitioner. Any advice welcome. I've attached the script and a bowdlerized copy of the output, since I fear the worst for the formatting on the text below.
With no partitioner, among a few million other million lines, my script produces this one correctly: ========= twitter_user_profile twitter_user_profile-0000018421-20081205-184526 0000018421 M...e http://http:\\www.MyWebsitee.com S, NJ I... notice. Eastern Time (US & Canada) -18000 20081205-184526 ========= ( was called using: ) hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \ -mapper /home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \ -reducer /home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb \ -input rawd/keyed/_20081205'/user-keyed.tsv' \ -output out/"parsed-$output_id" Note that the website field contained http://http:\\www.MyWebsitee.com (this person clearly either fails at internet or wins at windows) When I use a KeyFieldBasedPartitioner, it behaves correctly *except* on these few lines with backslashes, generating instead a single backslash followed by a tab: ========= twitter_user_profile twitter_user_profile-0000018421-20081205-184526 0000018421 M...e http://http:\ www.MyWebsitee.com S, NJ I... notice. Eastern Time (US & Canada) -18000 20081205-184526 ========= ( was called using: ) hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \ -jobconf map.output.key.field.separator='\t' \ -jobconf num.key.fields.for.partition=1 \ -jobconf stream.map.output.field.separator='\t' \ -jobconf stream.num.map.output.key.fields=2 \ -mapper /home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \ -reducer /home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb \ -input rawd/keyed/_20081205'/user-keyed.tsv' \ -output out/"parsed-$output_id" When I run the script on the command line cat input | hadoop_parse_json.rb | sort -k1,2 | hadoop_uniq_without_timestamp.rb everything works as I'd like. I've hunted through the JIRA and found nothing. If this sounds like a problem with hadoop I'll try to isolate a proper test case. Thanks for any advice, flip
The output of my script with no secondary sort produces, among a few million others, this line correctly: ========= twitter_user_profile twitter_user_profile-0000018421-20081205-184526 0000018421 M...e http://http:\\www.MyWebsitee.com S, NJ I... notice. Eastern Time (US & Canada) -18000 20081205-184526 ========= When I use a KeyFieldBasedPartitioner, it reaches in and diddles lines with backslashes: ========= twitter_user_profile twitter_user_profile-0000018421-20081205-184526 0000018421 M...e http://http:\ www.MyWebsitee.com S, NJ I... notice. Eastern Time (US & Canada) -18000 20081205-184526 ========= =========================================================================== == == Script, with partitioner == #!/usr/bin/env bash input_id=$1 output_id=$2 hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \ -jobconf map.output.key.field.separator='\t' \ -jobconf num.key.fields.for.partition=1 \ -jobconf stream.map.output.field.separator='\t' \ -jobconf stream.num.map.output.key.fields=2 \ -mapper /home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \ -reducer /home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb \ -input rawd/keyed/_20081205'/user-keyed.tsv' \ -output out/"parsed-$output_id" \ -file hadoop_utils.rb \ -file twitter_flat_model.rb \ -file twitter_autourl.rb == Excerpt of output. Everything is correct except the url field twitter_user_profile twitter_user_profile-0000018441-20081205-024904 0000018441 G..er http://www.l... D... O fun...:-) 20081205-024904 twitter_user_profile twitter_user_profile-0000018441-20081205-084448 0000018441 S...e Eastern Time (US & Canada) -18000 20081205-084448 twitter_user_profile twitter_user_profile-0000018421-20081205-184526 0000018421 M...e http://http:\ www.MyWebsitee.com S, NJ I... notice. Eastern Time (US & Canada) -18000 20081205-184526 twitter_user_profile twitter_user_profile-0000018481-20081205-030907 0000018481 J http://i....com D... T.... A 43200 20081205-030907 twitter_user_profile twitter_user_profile-0000018401-20081205-010944 0000018401 O London 0 20081205-010944 == Removing the partitioner... #!/usr/bin/env bash input_id=$1 output_id=$2 hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \ -mapper /home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \ -reducer /home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb \ -input rawd/keyed/_20081205'/user-keyed.tsv' \ -output out/"parsed-$output_id" \ -file hadoop_utils.rb \ -file twitter_flat_model.rb \ -file twitter_autourl.rb == ... leaves output fields unmolested. twitter_user_profile twitter_user_profile-0000059832-20081205-184727 0000059832 m...o S 28800 20081205-184727 twitter_user_profile twitter_user_profile-0000146069-20081205-184637 0000146069 M...d 20081205-184637 twitter_user_profile twitter_user_profile-0000000069-20081205-184525 0000000069 M.... http://www.m.....vox.com S C..... Eastern Time (US & Canada) -18000 20081205-184525 twitter_user_profile twitter_user_profile-0000167822-20081205-184710 0000167822 M...y 20081205-184710 twitter_user_profile twitter_user_profile-0000117502-20081205-184637 0000117502 M...g "" "" B 3600 20081205-184637 twitter_user_profile twitter_user_profile-0000018421-20081205-184526 0000018421 M...e http://http:\\www.MyWebsitee.com S, NJ I... notice. Eastern Time (US & Canada) -18000 20081205-184526 twitter_user_profile twitter_user_profile-0000147671-20081205-184455 0000147671 M....k http://www.C...U.com E, IL C....,. Central Time (US & Canada) -21600 20081205-184455 twitter_user_profile twitter_user_profile-0000161375-20081205-184637 0000161375 M....y Q -18000 20081205-184637 twitter_user_profile twitter_user_profile-0000142698-20081205-184527 0000142698 M....r 20081205-184527
