The problem is solved. I had to make sure that the streaming file is given
in "-input" and the other file is given in "-file". That solved the issue.

Thanks,
PD

On Fri, Aug 31, 2012 at 10:07 AM, Periya.Data <periya.d...@gmail.com> wrote:

> Yes, both input files need to be processed by the mapper..but not in the
> same fashion. Essentially, this is what my Python script does:
> - read two text files - A and B. file A has a list of account-IDs (all
> numeric). File B has about 10 records - some of which has the same
> account_ID as those listed in file A.
> - mapper: read both the files, compares and prints out those records that
> have matching account_IDs.
>
> I have tried placing both the input files under a single input directory.
> Same behavior.
>
> And, from what I have read so far, "-mapper" or "-reducer" should have
> "ONLY" the name of the executable (like...in my case, "test2.py".). But, if
> I do that, nothing happens. I have to explicitly mention:
> -mapper "cat $1 | python $GHU_HOME/test2.py $2"...something like
> that...which looks unconventional...but, it produces "some" output...not
> the correct one though.
>
> Again, if I run my script in just plain linux machine, using the basic
> commands :
> cat $1 | python test2.py $2,
> it produces the expected output.
>
>
> *Observation*: If I do not specify the two files under "- file" option,
> then, I see no output written to HDFS..even though the output directory has
> empy part-files and SUCCESS directory. The 3-part files are reasonable - as
> 3 mappers are configured for each job.
>
>
> My current command:
>
> hadoop jar ...streaming.jar
>          -input /user/ghu/input/* \
>          -output /user/ghu/out file /home/ghu/test2.py \
>          -mapper "cat $1 | python test2.py $2" \
>          -file /home/ghu/$1 \
>          -file /home/ghu/$2
>
>
> Learning,
> /PD
>
> On Thu, Aug 30, 2012 at 9:46 PM, Hemanth Yamijala <yhema...@gmail.com>wrote:
>
>> Hi,
>>
>> Do both input files contain data that needs to be processed by the
>> mapper in the same fashion ? In which case, you could just put the
>> input files under a directory in HDFS and provide that as input. The
>> -input option does accept a directory as argument.
>>
>> Otherwise, can you please explain a little more what you're trying to
>> do with the two inputs.
>>
>> Thanks
>> Hemanth
>>
>> On Fri, Aug 31, 2012 at 3:00 AM, Periya.Data <periya.d...@gmail.com>
>> wrote:
>> > This is interesting. I changed my command to:
>> >
>> > -mapper "cat $1 |  $GHU_HOME/test2.py $2" \
>> >
>> > is producing output to HDFS. But, the output is not what I expected and
>> is
>> > not the same as when I do "cat | map " on Linux. It is producing
>> > part-00000, part-00001 and part-00002. I expected only one output file
>> with
>> > just 2 records.
>> >
>> > I think I have to understand what exactly "-file" does and what exactly
>> > "-input" does. I am experimenting what happens if I give my input files
>> on
>> > the command line (like: test2.py arg1 arg2) as against specifying the
>> input
>> > files via "-file" and "-input" options...
>> >
>> > The problem is I have 2 input files...and have no idea how to pass them.
>> > SHould I keep one in HDFS and stream in the other?
>> >
>> > More digging,
>> > PD/
>> >
>> >
>> >
>> > On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data <periya.d...@gmail.com>
>> wrote:
>> >
>> >> Hi Bertrand,
>> >>     No, I do not observe the same when I run using cat | map. I can see
>> >> the output in STDOUT when I run my program.
>> >>
>> >> I do not have any reducer. In my command, I provide
>> >> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be
>> >> written directly to HDFS.
>> >>
>> >> Your suspicion maybe right..about the output. In my counters, the "map
>> >> input records" = 40 and "map.output records" = 0. I am trying to see
>> if I
>> >> am messing up in my command...(see below)
>> >>
>> >> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I
>> am
>> >> streaming one file in and test2.py takes in only one argument. How
>> should I
>> >> frame my command below? I think that is where I am messing up..
>> >>
>> >>
>> >> run.sh:        (run as:   cat <arg2> | ./run.sh <arg1> )
>> >> -----------
>> >>
>> >> hadoop jar
>> >> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
>> >>         -D mapred.reduce.tasks=0 \
>> >>         -verbose \
>> >>         -input "$HDFS_INPUT" \
>> >>         -input "$HDFS_INPUT_2" \
>> >>         -output "$HDFS_OUTPUT" \
>> >>         -file   "$GHU_HOME/test2.py" \
>> >>         -mapper "python $GHU_HOME/test2.py $1" \
>> >>         -file   "$GHU_HOME/$1"
>> >>
>> >>
>> >>
>> >> If I modify my mapper to take in 2 arguments, then, I would run it as:
>> >>
>> >> run.sh:        (run as:   ./run.sh <arg1>  <arg2>)
>> >> -----------
>> >>
>> >> hadoop jar
>> >> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
>> >>         -D mapred.reduce.tasks=0 \
>> >>         -verbose \
>> >>         -input "$HDFS_INPUT" \
>> >>         -input "$HDFS_INPUT_2" \
>> >>         -output "$HDFS_OUTPUT" \
>> >>         -file   "$GHU_HOME/test2.py" \
>> >>         -mapper "python $GHU_HOME/test2.py $1 $2" \
>> >>         -file   "$GHU_HOME/$1" \
>> >>         -file   "GHU_HOME/$2"
>> >>
>> >>
>> >> Please let me know if I am making a mistake here.
>> >>
>> >>
>> >> Thanks.
>> >> PD
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux <decho...@gmail.com
>> >wrote:
>> >>
>> >>> Do you observe the same thing when running without Hadoop? (cat, map,
>> sort
>> >>> and then reduce)
>> >>>
>> >>> Could you provide the counters of your job? You should be able to get
>> them
>> >>> using the job tracker interface.
>> >>>
>> >>> The most probable answer without more information would be that your
>> >>> reducer do not output any <key,value>s.
>> >>>
>> >>> Regards
>> >>>
>> >>> Bertrand
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <periya.d...@gmail.com>
>> >>> wrote:
>> >>>
>> >>> > Hi All,
>> >>> >    My Hadoop streaming job (in Python) runs to "completion" (both
>> map
>> >>> and
>> >>> > reduce says 100% complete). But, when I look at the output
>> directory in
>> >>> > HDFS, the part files are empty. I do not know what might be causing
>> this
>> >>> > behavior. I understand that the percentages represent the records
>> that
>> >>> have
>> >>> > been read in (not processed).
>> >>> >
>> >>> > The following are some of the logs. The detailed logs from Cloudera
>> >>> Manager
>> >>> > says that there were no Map Outputs...which is interesting. Any
>> >>> > suggestions?
>> >>> >
>> >>> >
>> >>> > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
>> >>> > 12/08/30 03:27:14 INFO streaming.StreamJob:
>> >>> /usr/lib/hadoop-0.20/bin/hadoop
>> >>> > job  -Dmapred.job.tracker=xxxxx.yyy.com:8021 -kill
>> >>> job_201208232245_3182
>> >>> > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
>> >>> >
>> http://xxxxxx.yyyy.com:60030/jobdetails.jsp?jobid=job_201208232245_3182
>> >>> > 12/08/30 03:27:15 INFO streaming.StreamJob:  map 0%  reduce 0%
>> >>> > 12/08/30 03:27:20 INFO streaming.StreamJob:  map 33%  reduce 0%
>> >>> > 12/08/30 03:27:23 INFO streaming.StreamJob:  map 67%  reduce 0%
>> >>> > 12/08/30 03:27:29 INFO streaming.StreamJob:  map 100%  reduce 0%
>> >>> > 12/08/30 03:27:33 INFO streaming.StreamJob:  map 100%  reduce 100%
>> >>> > 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete:
>> >>> > job_201208232245_3182
>> >>> > 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU
>> >>> > Thu Aug 30 03:27:24 GMT 2012
>> >>> > *** END
>> >>> > bash-3.2$
>> >>> > bash-3.2$ hadoop fs -ls /user/ghu/
>> >>> > Found 5 items
>> >>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>> /user/GHU/_SUCCESS
>> >>> > drwxrwxrwx   - ghu hadoop          0 2012-08-30 03:27
>> /user/GHU/_logs
>> >>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>> >>> /user/GHU/part-00000
>> >>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>> >>> /user/GHU/part-00001
>> >>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>> >>> /user/GHU/part-00002
>> >>> > bash-3.2$
>> >>> >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------------------------------------------------
>> >>> >
>> >>> >
>> >>> > Metadata Status Succeeded  Type MapReduce  Id job_201208232245_3182
>> >>> > Name CaidMatch
>> >>> >  User srisrini  Mapper class PipeMapper  Reducer class
>> >>> >  Scheduler pool name default  Job input directory
>> >>> > hdfs://xxxxx.yyy.txt,hdfs://xxxx.yyyy.com/user/GHUcaidlist.txt  Job
>> >>> output
>> >>> > directory hdfs://xxxx.yyyy.com/user/GHU/  Timing
>> >>> > Duration 20.977s  Submit time Wed, 29 Aug 2012 08:27 PM  Start time
>> >>> Wed, 29
>> >>> > Aug 2012 08:27 PM  Finish time Wed, 29 Aug 2012 08:27 PM
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >  Progress and Scheduling Map Progress
>> >>> > 100.0%
>> >>> >  Reduce Progress
>> >>> > 100.0%
>> >>> >  Launched maps 4  Data-local maps 3  Rack-local maps 1  Other local
>> maps
>> >>> >  Desired maps 3  Launched reducers
>> >>> >  Desired reducers 0  Fairscheduler running tasks
>> >>> >  Fairscheduler minimum share
>> >>> >  Fairscheduler demand
>> >>> >  Current Resource Usage Current User CPUs 0  Current System CPUs 0
>> >>> >  Resident
>> >>> > memory 0 B  Running maps 0  Running reducers 0  Aggregate Resource
>> Usage
>> >>> > and Counters User CPU 0s  System CPU 0s  Map Slot Time 12.135s
>>  Reduce
>> >>> slot
>> >>> > time 0s  Cumulative disk reads
>> >>> >  Cumulative disk writes 155.0 KiB  Cumulative HDFS reads 3.6 KiB
>> >>> >  Cumulative
>> >>> > HDFS writes
>> >>> >  Map input bytes 2.5 KiB  Map input records 45  Map output records 0
>> >>> >  Reducer
>> >>> > input groups
>> >>> >  Reducer input records
>> >>> >  Reducer output records
>> >>> >  Reducer shuffle bytes
>> >>> >  Spilled records
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Bertrand Dechoux
>> >>>
>> >>
>> >>
>>
>
>

Reply via email to