RE: Hadoop-streaming using binary executable c program

Daniel Yehdego Mon, 01 Aug 2011 15:13:41 -0700

Hi Bobby, 

I have written a small Perl script which do the following job:


Assume we have an output from the mapper

MAP1
<RNA-1>
<STRUCTURE-1>

MAP2
<RNA-2>
<STRUCTURE-2>

MAP3
<RNA-3>
<STRUCTURE-3>

and what the script does is reduce in the following manner : 
<RNA-1><RNA-2><RNA-3>\t<STRUCTURE-1><STRUCTURE-2><STRUCTURE-3>\n
 and the script looks like this:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my @handles = map { open my $h, '<', $_; $h } @ARGV;

while (@handles){
    @handles = grep { ! eof $_ } @handles;
    my @lines = map { my $v = <$_>; chomp $v; $v } @handles;
    print join(' ', @lines), "\n";
}

close $_ for @handles;

This should work for any inputs from the  mapper. But after I use hadoop 
streaming and put the above code as my reducer, the job was successful
but the output files were empty. And I couldn't find out.

 bin/hadoop jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar 
-mapper ./hadoopPknotsRG 
-file /data/yehdego/hadoop-0.20.2/pknotsRG 
-file /data/yehdego/hadoop-0.20.2/hadoopPknotsRG 
-reducer ./reducer.pl 
-file /data/yehdego/hadoop-0.20.2/reducer.pl  
-input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt 
-output /user/yehdego/RFR2-out - verbose

Any help or suggestion is really appreciated....I am just stuck here for the 
weekend.
 
Regards, 

Daniel T. Yehdego
Computational Science Program 
University of Texas at El Paso, UTEP 
[email protected]

> From: [email protected]
> To: [email protected]
> Date: Thu, 28 Jul 2011 07:12:11 -0700
> Subject: Re: Hadoop-streaming using binary executable c program
> 
> I am not completely sure what you are getting at.  It looks like the output 
> of your c program is (And this is just a guess)  NOTE: \t stands for the tab 
> character and in streaming it is used to separate the key from the value \n 
> stands for carriage return and is used to separate individual records..
> <RNA-1>\t<STRUCTURE-1>\n
> <RNA-2>\t<STRUCTURE-2>\n
> <RNA-3>\t<STRUCTURE-3>\n
> ...
> 
> 
> And you want the output to look like
> <RNA-1><RNA-2><RNA-3>\t<STRUCTURE-1><STRUCTURE-2><STRUCTURE-3>\n
> 
> You could use a reduce to do this, but the issue here is with the shuffle in 
> between the maps and the reduces.  The Shuffle will group by the key to send 
> to the reducers and then sort by the key.  So in reality your map output 
> looks something like
> 
> FROM MAP 1:
> <RNA-1>\t<STRUCTURE-1>\n
> <RNA-2>\t<STRUCTURE-2>\n
> 
> FROM MAP 2:
> <RNA-3>\t<STRUCTURE-3>\n
> <RNA-4>\t<STRUCTURE-4>\n
> 
> FROM MAP 3:
> <RNA-5>\t<STRUCTURE-5>\n
> <RNA-6>\t<STRUCTURE-6>\n
> 
> If you send it to a single reducer (The only way to get a single file) Then 
> the input to the reducer will be sorted alphabetically by the RNA, and the 
> order of the input will be lost.  You can work around this by giving each 
> line a unique number that is in the order you want It to be output.  But 
> doing this would require you to write some code.  I would suggest that you do 
> it with a small shell script after all the maps have completed to splice them 
> together.
> 
> --
> Bobby
> 
> On 7/27/11 2:55 PM, "Daniel Yehdego" <[email protected]> wrote:
> 
> 
> 
> Hi Bobby,
> 
> I just want to ask you if there is away of using a reducer or something like 
> concatenation to glue my outputs from the mapper and outputs
> them as a single file and segment of the predicted RNA 2D structure?
> 
> FYI: I have used a reducer NONE before:
> 
> HADOOP_HOME$ bin/hadoop jar
> /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper
> ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file
> /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input
> /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output
> /user/yehdego/RF-out -reducer NONE -verbose
> 
> and a sample of my output using the mapper of two different slave nodes looks 
> like this :
> 
> AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAACCCCAAAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGC
>     and
> [[[[[..................((((.(((((((...............))))))).))))............{{{{....]]]]].....}}}}....
>   (-13.46)
> 
> GGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUUUUUCU
> ((((.(((((....((.((((((.......))))))))...))))).)))).  (-11.00)
> 
> and I want to concatenate and output them as a single predicated RNA sequence 
> structure:
> 
> AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAACCCCAAAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGCGGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUUUUUCU
> 
> [[[[[..................((((.(((((((...............))))))).))))............{{{{....]]]]].....}}}}....((((.(((((....((.((((((.......))))))))...))))).)))).
> 
> 
> Regards,
> 
> Daniel T. Yehdego
> Computational Science Program
> University of Texas at El Paso, UTEP
> [email protected]
> 
> > From: [email protected]
> > To: [email protected]
> > Subject: RE: Hadoop-streaming using binary executable c program
> > Date: Tue, 26 Jul 2011 16:23:10 +0000
> >
> >
> > Good afternoon Bobby,
> >
> > Thanks so much, now its working excellent. And the speed is also 
> > reasonable. Once again thanks u.
> >
> > Regards,
> >
> > Daniel T. Yehdego
> > Computational Science Program
> > University of Texas at El Paso, UTEP
> > [email protected]
> >
> > > From: [email protected]
> > > To: [email protected]
> > > Date: Mon, 25 Jul 2011 14:47:34 -0700
> > > Subject: Re: Hadoop-streaming using binary executable c program
> > >
> > > This is likely to be slow and it is not ideal.  The ideal would be to 
> > > modify pknotsRG to be able to read from stdin, but that may not be 
> > > possible.
> > >
> > > The shell script would probably look something like the following
> > >
> > > #!/bin/sh
> > > rm -f temp.txt;
> > > while read line
> > > do
> > >   echo $line >> temp.txt;
> > > done
> > > exec pknotsRG temp.txt;
> > >
> > > Place it in a file say hadoopPknotsRG  Then you probably want to run
> > >
> > > chmod +x hadoopPknotsRG
> > >
> > > After that you want to test it with
> > >
> > > hadoop fs -cat 
> > > /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head 
> > > -2 | ./hadoopPknotsRG
> > >
> > > If that works then you can try it with Hadoop streaming
> > >
> > > HADOOP_HOME$ bin/hadoop jar 
> > > /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper 
> > > ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file 
> > > /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input 
> > > /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output 
> > > /user/yehdego/RF-out -reducer NONE -verbose
> > >
> > > --Bobby
> > >
> > > On 7/25/11 3:37 PM, "Daniel Yehdego" <[email protected]> wrote:
> > >
> > >
> > >
> > > Good afternoon Bobby,
> > >
> > > Thanks, you gave me a great help in finding out what the problem was. 
> > > After I put the command line you suggested me, I found out that there was 
> > > a segmentation error.
> > > The binary executable program pknotsRG only reads a file with a sequence 
> > > in it. This means, there should be a shell script, as you have said, that 
> > > will take the data coming
> > > from stdin and write it to a temporary file. Any idea on how to do this 
> > > job in shell script. The thing is I am from a biology background and 
> > > don't have much experience in CS.
> > > looking forward to hear from you. Thanks so much.
> > >
> > > Regards,
> > >
> > > Daniel T. Yehdego
> > > Computational Science Program
> > > University of Texas at El Paso, UTEP
> > > [email protected]
> > >
> > > > From: [email protected]
> > > > To: [email protected]
> > > > Date: Fri, 22 Jul 2011 12:39:08 -0700
> > > > Subject: Re: Hadoop-streaming using binary executable c program
> > > >
> > > > I would suggest that you do the following to help you debug.
> > > >
> > > > hadoop fs -cat 
> > > > /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | 
> > > > head -2 | /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -
> > > >
> > > > This is simulating what hadoop streaming is doing.  Here we are taking 
> > > > the first 2 lines out of the input file and feeding them to the stdin 
> > > > of pknotsRG.  The first step is to make sure that you can get your 
> > > > program to run correctly with something like this.  You may need to 
> > > > change the command line to pknotsRG to get it to read the data it is 
> > > > processing from stdin, instead of from a file.  Alternatively you may 
> > > > need to write a shell script that will take the data coming from stdin. 
> > > >  Write it to a file and then call pknotsRG on that temporary file.  
> > > > Once you have this working then you should try it again with streaming.
> > > >
> > > > --Bobby Evans
> > > >
> > > > On 7/22/11 12:31 PM, "Daniel Yehdego" <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > Hi Bobby, Thanks for the response.
> > > >
> > > > After I tried the following comannd:
> > > >
> > > > bin/hadoop jar $HADOOP_HOME/hadoop-0.20.2-streaming.jar -mapper 
> > > > /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -  -file 
> > > > /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG  -reducer NONE 
> > > > -input 
> > > > /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt 
> > > > -output /user/yehdego/RF-out - verbose
> > > >
> > > > I got a stderr logs :
> > > >
> > > > java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
> > > > failed with code 139
> > > >         at 
> > > > org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
> > > >         at 
> > > > org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
> > > >         at 
> > > > org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
> > > >         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> > > >         at 
> > > > org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
> > > >         at 
> > > > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> > > >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > > >         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > >
> > > >
> > > >
> > > > syslog logs
> > > >
> > > > 2011-07-22 13:02:27,467 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
> > > > Initializing JVM Metrics with processName=MAP, sessionId=
> > > > 2011-07-22 13:02:27,913 INFO org.apache.hadoop.mapred.MapTask: 
> > > > numReduceTasks: 0
> > > > 2011-07-22 13:02:28,149 INFO org.apache.hadoop.streaming.PipeMapRed: 
> > > > PipeMapRed exec 
> > > > [/data/yehdego/hadoop_tmp/dfs/local/taskTracker/jobcache/job_201107181535_0079/attempt_201107181535_0079_m_000000_0/work/./pknotsRG]
> > > > 2011-07-22 13:02:28,242 INFO org.apache.hadoop.streaming.PipeMapRed: 
> > > > R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
> > > > 2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
> > > > MROutputThread done
> > > > 2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
> > > > MRErrorThread done
> > > > 2011-07-22 13:02:28,267 INFO org.apache.hadoop.streaming.PipeMapRed: 
> > > > PipeMapRed failed!
> > > > 2011-07-22 13:02:28,361 WARN org.apache.hadoop.mapred.TaskTracker: 
> > > > Error running child
> > > > java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
> > > > failed with code 139
> > > >         at 
> > > > org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
> > > >         at 
> > > > org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
> > > >         at 
> > > > org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
> > > >         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> > > >         at 
> > > > org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
> > > >         at 
> > > > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> > > >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > > >         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > > 2011-07-22 13:02:28,395 INFO org.apache.hadoop.mapred.TaskRunner: 
> > > > Runnning cleanup for the task
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Daniel T. Yehdego
> > > > Computational Science Program
> > > > University of Texas at El Paso, UTEP
> > > > [email protected]
> > > >
> > > > > From: [email protected]
> > > > > To: [email protected]; [email protected]
> > > > > Date: Fri, 22 Jul 2011 09:12:18 -0700
> > > > > Subject: Re: Hadoop-streaming using binary executable c program
> > > > >
> > > > > It looks like it tried to run your program and the program exited 
> > > > > with a 1 not a 0.  What are the stderr logs like for the mappers that 
> > > > > were launched, you should be able to access them through the Web GUI? 
> > > > >  You might want to add in some stderr log messages to you c program 
> > > > > too. To be able to debug how far along it is going before exiting.
> > > > >
> > > > > --Bobby Evans
> > > > >
> > > > > On 7/22/11 9:19 AM, "Daniel Yehdego" <[email protected]> 
> > > > > wrote:
> > > > >
> > > > > I am trying to parallelize some very long RNA sequence for the sake of
> > > > > predicting their RNA 2D structures. I am using a binary executable c
> > > > > program called pknotsRG as my mapper. I tried the following bin/hadoop
> > > > > command:
> > > > >
> > > > > HADOOP_HOME$ bin/hadoop
> > > > > jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
> > > > > -mapper /data/yehdego/hadoop-0.20.2/pknotsRG
> > > > > -file /data/yehdego/hadoop-0.20.2/pknotsRG
> > > > > -input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt
> > > > > -output /user/yehdego/RF-out -reducer NONE -verbose
> > > > >
> > > > > but i keep getting the following error message:
> > > > >
> > > > > java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
> > > > > failed with code 1
> > > > >         at
> > > > > org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
> > > > >         at
> > > > > org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
> > > > >         at 
> > > > > org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
> > > > >         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> > > > >         at 
> > > > > org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
> > > > >         at 
> > > > > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> > > > >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > > > >         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > > >
> > > > > FYI: my input file is RF00028_B.bpseqL3G5_seg_Centered_Method.txt 
> > > > > which
> > > > > is a chunk of RNA sequences and the mapper is expected to get the 
> > > > > input
> > > > > and execute the input file line by line and out put the predicted
> > > > > structure for each line of sequence for a specified number of maps. 
> > > > > Any
> > > > > help on this problem is really appreciated. Thanks.
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> 
>

RE: Hadoop-streaming using binary executable c program

Reply via email to