The problem is solved. I had to make sure that the streaming file is given in "-input" and the other file is given in "-file". That solved the issue.
Thanks, PD On Fri, Aug 31, 2012 at 10:07 AM, Periya.Data <periya.d...@gmail.com> wrote: > Yes, both input files need to be processed by the mapper..but not in the > same fashion. Essentially, this is what my Python script does: > - read two text files - A and B. file A has a list of account-IDs (all > numeric). File B has about 10 records - some of which has the same > account_ID as those listed in file A. > - mapper: read both the files, compares and prints out those records that > have matching account_IDs. > > I have tried placing both the input files under a single input directory. > Same behavior. > > And, from what I have read so far, "-mapper" or "-reducer" should have > "ONLY" the name of the executable (like...in my case, "test2.py".). But, if > I do that, nothing happens. I have to explicitly mention: > -mapper "cat $1 | python $GHU_HOME/test2.py $2"...something like > that...which looks unconventional...but, it produces "some" output...not > the correct one though. > > Again, if I run my script in just plain linux machine, using the basic > commands : > cat $1 | python test2.py $2, > it produces the expected output. > > > *Observation*: If I do not specify the two files under "- file" option, > then, I see no output written to HDFS..even though the output directory has > empy part-files and SUCCESS directory. The 3-part files are reasonable - as > 3 mappers are configured for each job. > > > My current command: > > hadoop jar ...streaming.jar > -input /user/ghu/input/* \ > -output /user/ghu/out file /home/ghu/test2.py \ > -mapper "cat $1 | python test2.py $2" \ > -file /home/ghu/$1 \ > -file /home/ghu/$2 > > > Learning, > /PD > > On Thu, Aug 30, 2012 at 9:46 PM, Hemanth Yamijala <yhema...@gmail.com>wrote: > >> Hi, >> >> Do both input files contain data that needs to be processed by the >> mapper in the same fashion ? In which case, you could just put the >> input files under a directory in HDFS and provide that as input. The >> -input option does accept a directory as argument. >> >> Otherwise, can you please explain a little more what you're trying to >> do with the two inputs. >> >> Thanks >> Hemanth >> >> On Fri, Aug 31, 2012 at 3:00 AM, Periya.Data <periya.d...@gmail.com> >> wrote: >> > This is interesting. I changed my command to: >> > >> > -mapper "cat $1 | $GHU_HOME/test2.py $2" \ >> > >> > is producing output to HDFS. But, the output is not what I expected and >> is >> > not the same as when I do "cat | map " on Linux. It is producing >> > part-00000, part-00001 and part-00002. I expected only one output file >> with >> > just 2 records. >> > >> > I think I have to understand what exactly "-file" does and what exactly >> > "-input" does. I am experimenting what happens if I give my input files >> on >> > the command line (like: test2.py arg1 arg2) as against specifying the >> input >> > files via "-file" and "-input" options... >> > >> > The problem is I have 2 input files...and have no idea how to pass them. >> > SHould I keep one in HDFS and stream in the other? >> > >> > More digging, >> > PD/ >> > >> > >> > >> > On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data <periya.d...@gmail.com> >> wrote: >> > >> >> Hi Bertrand, >> >> No, I do not observe the same when I run using cat | map. I can see >> >> the output in STDOUT when I run my program. >> >> >> >> I do not have any reducer. In my command, I provide >> >> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be >> >> written directly to HDFS. >> >> >> >> Your suspicion maybe right..about the output. In my counters, the "map >> >> input records" = 40 and "map.output records" = 0. I am trying to see >> if I >> >> am messing up in my command...(see below) >> >> >> >> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I >> am >> >> streaming one file in and test2.py takes in only one argument. How >> should I >> >> frame my command below? I think that is where I am messing up.. >> >> >> >> >> >> run.sh: (run as: cat <arg2> | ./run.sh <arg1> ) >> >> ----------- >> >> >> >> hadoop jar >> >> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \ >> >> -D mapred.reduce.tasks=0 \ >> >> -verbose \ >> >> -input "$HDFS_INPUT" \ >> >> -input "$HDFS_INPUT_2" \ >> >> -output "$HDFS_OUTPUT" \ >> >> -file "$GHU_HOME/test2.py" \ >> >> -mapper "python $GHU_HOME/test2.py $1" \ >> >> -file "$GHU_HOME/$1" >> >> >> >> >> >> >> >> If I modify my mapper to take in 2 arguments, then, I would run it as: >> >> >> >> run.sh: (run as: ./run.sh <arg1> <arg2>) >> >> ----------- >> >> >> >> hadoop jar >> >> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \ >> >> -D mapred.reduce.tasks=0 \ >> >> -verbose \ >> >> -input "$HDFS_INPUT" \ >> >> -input "$HDFS_INPUT_2" \ >> >> -output "$HDFS_OUTPUT" \ >> >> -file "$GHU_HOME/test2.py" \ >> >> -mapper "python $GHU_HOME/test2.py $1 $2" \ >> >> -file "$GHU_HOME/$1" \ >> >> -file "GHU_HOME/$2" >> >> >> >> >> >> Please let me know if I am making a mistake here. >> >> >> >> >> >> Thanks. >> >> PD >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux <decho...@gmail.com >> >wrote: >> >> >> >>> Do you observe the same thing when running without Hadoop? (cat, map, >> sort >> >>> and then reduce) >> >>> >> >>> Could you provide the counters of your job? You should be able to get >> them >> >>> using the job tracker interface. >> >>> >> >>> The most probable answer without more information would be that your >> >>> reducer do not output any <key,value>s. >> >>> >> >>> Regards >> >>> >> >>> Bertrand >> >>> >> >>> >> >>> >> >>> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <periya.d...@gmail.com> >> >>> wrote: >> >>> >> >>> > Hi All, >> >>> > My Hadoop streaming job (in Python) runs to "completion" (both >> map >> >>> and >> >>> > reduce says 100% complete). But, when I look at the output >> directory in >> >>> > HDFS, the part files are empty. I do not know what might be causing >> this >> >>> > behavior. I understand that the percentages represent the records >> that >> >>> have >> >>> > been read in (not processed). >> >>> > >> >>> > The following are some of the logs. The detailed logs from Cloudera >> >>> Manager >> >>> > says that there were no Map Outputs...which is interesting. Any >> >>> > suggestions? >> >>> > >> >>> > >> >>> > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run: >> >>> > 12/08/30 03:27:14 INFO streaming.StreamJob: >> >>> /usr/lib/hadoop-0.20/bin/hadoop >> >>> > job -Dmapred.job.tracker=xxxxx.yyy.com:8021 -kill >> >>> job_201208232245_3182 >> >>> > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL: >> >>> > >> http://xxxxxx.yyyy.com:60030/jobdetails.jsp?jobid=job_201208232245_3182 >> >>> > 12/08/30 03:27:15 INFO streaming.StreamJob: map 0% reduce 0% >> >>> > 12/08/30 03:27:20 INFO streaming.StreamJob: map 33% reduce 0% >> >>> > 12/08/30 03:27:23 INFO streaming.StreamJob: map 67% reduce 0% >> >>> > 12/08/30 03:27:29 INFO streaming.StreamJob: map 100% reduce 0% >> >>> > 12/08/30 03:27:33 INFO streaming.StreamJob: map 100% reduce 100% >> >>> > 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete: >> >>> > job_201208232245_3182 >> >>> > 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU >> >>> > Thu Aug 30 03:27:24 GMT 2012 >> >>> > *** END >> >>> > bash-3.2$ >> >>> > bash-3.2$ hadoop fs -ls /user/ghu/ >> >>> > Found 5 items >> >>> > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 >> /user/GHU/_SUCCESS >> >>> > drwxrwxrwx - ghu hadoop 0 2012-08-30 03:27 >> /user/GHU/_logs >> >>> > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 >> >>> /user/GHU/part-00000 >> >>> > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 >> >>> /user/GHU/part-00001 >> >>> > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 >> >>> /user/GHU/part-00002 >> >>> > bash-3.2$ >> >>> > >> >>> > >> >>> >> -------------------------------------------------------------------------------------------------------------------- >> >>> > >> >>> > >> >>> > Metadata Status Succeeded Type MapReduce Id job_201208232245_3182 >> >>> > Name CaidMatch >> >>> > User srisrini Mapper class PipeMapper Reducer class >> >>> > Scheduler pool name default Job input directory >> >>> > hdfs://xxxxx.yyy.txt,hdfs://xxxx.yyyy.com/user/GHUcaidlist.txt Job >> >>> output >> >>> > directory hdfs://xxxx.yyyy.com/user/GHU/ Timing >> >>> > Duration 20.977s Submit time Wed, 29 Aug 2012 08:27 PM Start time >> >>> Wed, 29 >> >>> > Aug 2012 08:27 PM Finish time Wed, 29 Aug 2012 08:27 PM >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > Progress and Scheduling Map Progress >> >>> > 100.0% >> >>> > Reduce Progress >> >>> > 100.0% >> >>> > Launched maps 4 Data-local maps 3 Rack-local maps 1 Other local >> maps >> >>> > Desired maps 3 Launched reducers >> >>> > Desired reducers 0 Fairscheduler running tasks >> >>> > Fairscheduler minimum share >> >>> > Fairscheduler demand >> >>> > Current Resource Usage Current User CPUs 0 Current System CPUs 0 >> >>> > Resident >> >>> > memory 0 B Running maps 0 Running reducers 0 Aggregate Resource >> Usage >> >>> > and Counters User CPU 0s System CPU 0s Map Slot Time 12.135s >> Reduce >> >>> slot >> >>> > time 0s Cumulative disk reads >> >>> > Cumulative disk writes 155.0 KiB Cumulative HDFS reads 3.6 KiB >> >>> > Cumulative >> >>> > HDFS writes >> >>> > Map input bytes 2.5 KiB Map input records 45 Map output records 0 >> >>> > Reducer >> >>> > input groups >> >>> > Reducer input records >> >>> > Reducer output records >> >>> > Reducer shuffle bytes >> >>> > Spilled records >> >>> > >> >>> >> >>> >> >>> >> >>> -- >> >>> Bertrand Dechoux >> >>> >> >> >> >> >> > >