Re: Hadoop Streaming Semantics

2009-02-02 Thread Amareshwari Sriramadasu

S D wrote:

Thanks for your response. I'm using version 0.19.0 of Hadoop.
I tried your suggestion. Here is the line I use to invoke Hadoop

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar \\
   -input /user/hadoop/hadoop-input/inputFile.txt \\
   -output /user/hadoop/hadoop-output \\
   -mapper map-script.sh \\
   -file map-script.sh \\
   -file additional-script.rb \\ # Called by map-script.sh
   -file utils.rb \\
   -file env.sh \\
   -file aws-s3-credentials-file \\# For permissions to use AWS::S3
   -jobconf mapred.reduce.tasks=0 \\
   -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

Everything works fine if the -inputformat switch is not included but when I
include it I get the following message:
   ERROR streaming.StreamJob: Job not Successful!
and a Runtime exception shows up in the jobtracker log:
   PipeMapRed.waitOutputThreads(): subprocess failed with code 1

My map functions read each line of the input file and create a directory
(one for each line) on Hadoop (in our case S3 Native) in which corresponding
data is produced and stored. The name of the created directories are based
on the contents of the corresponding line. When I include the -inputformat
line above I've noticed that instead of the directories I'm expecting (named
after the data found in the input file), the directories are given seemingly
arbitrary numeric names; e.g., when the input file contained four lines of
data, the directories were named: 0, 273, 546 and 819.

  

LineRecordReader reads "line" as VALUE and the KEY is offset in the file.
Looks like your directories are getting named with KEY. But I don't  see 
any reason for that, because it is working fine with TextInputFormat 
(both TextInFormat and NLineInputFormat use LineRecordReader.)


-Amareshwari

Any thoughts?

John

On Sun, Feb 1, 2009 at 11:00 PM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:

  

Which version of hadoop are you using?

You can directly use -inputformat
org.apache.hadoop.mapred.lib.NLineInputFormat for your streaming job. You
need not include it in your streaming jar.
-Amareshwari


S D wrote:



Thanks for your response Amereshwari. I'm unclear on how to take advantage
of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the
streaming jar file (contrib/streaming/hadoop--streaming.jar) to
include the NLineInputFormat class and then pass a command line
configuration param to indicate that NLineInputFormat should be used? If
this is the proper approach, can you point me to an example of what kind
of
param should be specified? I appreciate your help.

Thanks,
SD

On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:



  

You can use NLineInputFormat for this, which splits one line (N=1, by
default) as one split.
So, each map task processes one line.
See

http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

-Amareshwari

S D wrote:





Hello,

I have a clarifying question about Hadoop streaming. I'm new to the list
and
didn't see anything posted that covers my questions - my apologies if I
overlooked a relevant post.

I have an input file consisting of a list of files (one per line) that
need
to be processed independently of each other. The duration for processing
each file is significant - perhaps an hour each. I'm using Hadoop
streaming
without a reduce function to process each file and save the results
(back
to
S3 native in my case). To handle to long processing time of each file
I've
set mapred.task.timeout=0 and I have a pretty straight forward Ruby
script
reading from STDIN:

STDIN.each_line do |line|
 # Get file from contents of line
 # Process file (long running)
end

Currently I'm using a cluster of 3 workers in which each worker can have
up
to 2 tasks running simultaneously. I've noticed that if I have a single
input file with many lines (more than 6 given my cluster), then not all
workers will be allocated tasks; I've noticed two workers being
allocated
one task each and the other worker sitting idly. If I split my input
file
into multiple files (at least 6) then all workers will be immediately
allocated the maximum number of tasks that they can handle.

My interpretation on this is fuzzy. It seems that Hadoop streaming will
take
separate input files and allocate a new task per file (up to the maximum
constraint) but if given a single input file it is unclear as to whether
a
new task is allocated per file or line. My understanding of Hadoop Java
is
that (unlike Hadoop streaming) when given a single input file, the file
will
be broken up into separate lines and the maximum number of map tasks
will
automagically be allocated to handle the lines of the file (assuming the
use
of TextInputFormat).

Can someone clarify this?

Thanks,
SD





  



  



  




Re: Hadoop Streaming Semantics

2009-02-02 Thread S D
Thanks for your response. I'm using version 0.19.0 of Hadoop.
I tried your suggestion. Here is the line I use to invoke Hadoop

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar \\
   -input /user/hadoop/hadoop-input/inputFile.txt \\
   -output /user/hadoop/hadoop-output \\
   -mapper map-script.sh \\
   -file map-script.sh \\
   -file additional-script.rb \\ # Called by map-script.sh
   -file utils.rb \\
   -file env.sh \\
   -file aws-s3-credentials-file \\# For permissions to use AWS::S3
   -jobconf mapred.reduce.tasks=0 \\
   -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

Everything works fine if the -inputformat switch is not included but when I
include it I get the following message:
   ERROR streaming.StreamJob: Job not Successful!
and a Runtime exception shows up in the jobtracker log:
   PipeMapRed.waitOutputThreads(): subprocess failed with code 1

My map functions read each line of the input file and create a directory
(one for each line) on Hadoop (in our case S3 Native) in which corresponding
data is produced and stored. The name of the created directories are based
on the contents of the corresponding line. When I include the -inputformat
line above I've noticed that instead of the directories I'm expecting (named
after the data found in the input file), the directories are given seemingly
arbitrary numeric names; e.g., when the input file contained four lines of
data, the directories were named: 0, 273, 546 and 819.

Any thoughts?

John

On Sun, Feb 1, 2009 at 11:00 PM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:

> Which version of hadoop are you using?
>
> You can directly use -inputformat
> org.apache.hadoop.mapred.lib.NLineInputFormat for your streaming job. You
> need not include it in your streaming jar.
> -Amareshwari
>
>
> S D wrote:
>
>> Thanks for your response Amereshwari. I'm unclear on how to take advantage
>> of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the
>> streaming jar file (contrib/streaming/hadoop--streaming.jar) to
>> include the NLineInputFormat class and then pass a command line
>> configuration param to indicate that NLineInputFormat should be used? If
>> this is the proper approach, can you point me to an example of what kind
>> of
>> param should be specified? I appreciate your help.
>>
>> Thanks,
>> SD
>>
>> On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu <
>> amar...@yahoo-inc.com> wrote:
>>
>>
>>
>>> You can use NLineInputFormat for this, which splits one line (N=1, by
>>> default) as one split.
>>> So, each map task processes one line.
>>> See
>>>
>>> http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>
>>> -Amareshwari
>>>
>>> S D wrote:
>>>
>>>
>>>
 Hello,

 I have a clarifying question about Hadoop streaming. I'm new to the list
 and
 didn't see anything posted that covers my questions - my apologies if I
 overlooked a relevant post.

 I have an input file consisting of a list of files (one per line) that
 need
 to be processed independently of each other. The duration for processing
 each file is significant - perhaps an hour each. I'm using Hadoop
 streaming
 without a reduce function to process each file and save the results
 (back
 to
 S3 native in my case). To handle to long processing time of each file
 I've
 set mapred.task.timeout=0 and I have a pretty straight forward Ruby
 script
 reading from STDIN:

 STDIN.each_line do |line|
  # Get file from contents of line
  # Process file (long running)
 end

 Currently I'm using a cluster of 3 workers in which each worker can have
 up
 to 2 tasks running simultaneously. I've noticed that if I have a single
 input file with many lines (more than 6 given my cluster), then not all
 workers will be allocated tasks; I've noticed two workers being
 allocated
 one task each and the other worker sitting idly. If I split my input
 file
 into multiple files (at least 6) then all workers will be immediately
 allocated the maximum number of tasks that they can handle.

 My interpretation on this is fuzzy. It seems that Hadoop streaming will
 take
 separate input files and allocate a new task per file (up to the maximum
 constraint) but if given a single input file it is unclear as to whether
 a
 new task is allocated per file or line. My understanding of Hadoop Java
 is
 that (unlike Hadoop streaming) when given a single input file, the file
 will
 be broken up into separate lines and the maximum number of map tasks
 will
 automagically be allocated to handle the lines of the file (assuming the
 use
 of TextInputFormat).

 Can someone clarify this?

 Thanks,
 SD





>>>
>>>
>>
>>
>>
>
>


Re: Hadoop Streaming Semantics

2009-02-01 Thread Amareshwari Sriramadasu

Which version of hadoop are you using?

You can directly use -inputformat 
org.apache.hadoop.mapred.lib.NLineInputFormat for your streaming job. 
You need not include it in your streaming jar.

-Amareshwari

S D wrote:

Thanks for your response Amereshwari. I'm unclear on how to take advantage
of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the
streaming jar file (contrib/streaming/hadoop--streaming.jar) to
include the NLineInputFormat class and then pass a command line
configuration param to indicate that NLineInputFormat should be used? If
this is the proper approach, can you point me to an example of what kind of
param should be specified? I appreciate your help.

Thanks,
SD

On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:

  

You can use NLineInputFormat for this, which splits one line (N=1, by
default) as one split.
So, each map task processes one line.
See
http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

-Amareshwari

S D wrote:



Hello,

I have a clarifying question about Hadoop streaming. I'm new to the list
and
didn't see anything posted that covers my questions - my apologies if I
overlooked a relevant post.

I have an input file consisting of a list of files (one per line) that
need
to be processed independently of each other. The duration for processing
each file is significant - perhaps an hour each. I'm using Hadoop
streaming
without a reduce function to process each file and save the results (back
to
S3 native in my case). To handle to long processing time of each file I've
set mapred.task.timeout=0 and I have a pretty straight forward Ruby script
reading from STDIN:

STDIN.each_line do |line|
  # Get file from contents of line
  # Process file (long running)
end

Currently I'm using a cluster of 3 workers in which each worker can have
up
to 2 tasks running simultaneously. I've noticed that if I have a single
input file with many lines (more than 6 given my cluster), then not all
workers will be allocated tasks; I've noticed two workers being allocated
one task each and the other worker sitting idly. If I split my input file
into multiple files (at least 6) then all workers will be immediately
allocated the maximum number of tasks that they can handle.

My interpretation on this is fuzzy. It seems that Hadoop streaming will
take
separate input files and allocate a new task per file (up to the maximum
constraint) but if given a single input file it is unclear as to whether a
new task is allocated per file or line. My understanding of Hadoop Java is
that (unlike Hadoop streaming) when given a single input file, the file
will
be broken up into separate lines and the maximum number of map tasks will
automagically be allocated to handle the lines of the file (assuming the
use
of TextInputFormat).

Can someone clarify this?

Thanks,
SD



  



  




Re: Hadoop Streaming Semantics

2009-01-30 Thread S D
Thanks for your response Amereshwari. I'm unclear on how to take advantage
of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the
streaming jar file (contrib/streaming/hadoop--streaming.jar) to
include the NLineInputFormat class and then pass a command line
configuration param to indicate that NLineInputFormat should be used? If
this is the proper approach, can you point me to an example of what kind of
param should be specified? I appreciate your help.

Thanks,
SD

On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:

> You can use NLineInputFormat for this, which splits one line (N=1, by
> default) as one split.
> So, each map task processes one line.
> See
> http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>
> -Amareshwari
>
> S D wrote:
>
>> Hello,
>>
>> I have a clarifying question about Hadoop streaming. I'm new to the list
>> and
>> didn't see anything posted that covers my questions - my apologies if I
>> overlooked a relevant post.
>>
>> I have an input file consisting of a list of files (one per line) that
>> need
>> to be processed independently of each other. The duration for processing
>> each file is significant - perhaps an hour each. I'm using Hadoop
>> streaming
>> without a reduce function to process each file and save the results (back
>> to
>> S3 native in my case). To handle to long processing time of each file I've
>> set mapred.task.timeout=0 and I have a pretty straight forward Ruby script
>> reading from STDIN:
>>
>> STDIN.each_line do |line|
>>   # Get file from contents of line
>>   # Process file (long running)
>> end
>>
>> Currently I'm using a cluster of 3 workers in which each worker can have
>> up
>> to 2 tasks running simultaneously. I've noticed that if I have a single
>> input file with many lines (more than 6 given my cluster), then not all
>> workers will be allocated tasks; I've noticed two workers being allocated
>> one task each and the other worker sitting idly. If I split my input file
>> into multiple files (at least 6) then all workers will be immediately
>> allocated the maximum number of tasks that they can handle.
>>
>> My interpretation on this is fuzzy. It seems that Hadoop streaming will
>> take
>> separate input files and allocate a new task per file (up to the maximum
>> constraint) but if given a single input file it is unclear as to whether a
>> new task is allocated per file or line. My understanding of Hadoop Java is
>> that (unlike Hadoop streaming) when given a single input file, the file
>> will
>> be broken up into separate lines and the maximum number of map tasks will
>> automagically be allocated to handle the lines of the file (assuming the
>> use
>> of TextInputFormat).
>>
>> Can someone clarify this?
>>
>> Thanks,
>> SD
>>
>>
>>
>
>


Re: Hadoop Streaming Semantics

2009-01-29 Thread Amareshwari Sriramadasu
You can use NLineInputFormat for this, which splits one line (N=1, by 
default) as one split.

So, each map task processes one line.
See 
http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html


-Amareshwari
S D wrote:

Hello,

I have a clarifying question about Hadoop streaming. I'm new to the list and
didn't see anything posted that covers my questions - my apologies if I
overlooked a relevant post.

I have an input file consisting of a list of files (one per line) that need
to be processed independently of each other. The duration for processing
each file is significant - perhaps an hour each. I'm using Hadoop streaming
without a reduce function to process each file and save the results (back to
S3 native in my case). To handle to long processing time of each file I've
set mapred.task.timeout=0 and I have a pretty straight forward Ruby script
reading from STDIN:

STDIN.each_line do |line|
   # Get file from contents of line
   # Process file (long running)
end

Currently I'm using a cluster of 3 workers in which each worker can have up
to 2 tasks running simultaneously. I've noticed that if I have a single
input file with many lines (more than 6 given my cluster), then not all
workers will be allocated tasks; I've noticed two workers being allocated
one task each and the other worker sitting idly. If I split my input file
into multiple files (at least 6) then all workers will be immediately
allocated the maximum number of tasks that they can handle.

My interpretation on this is fuzzy. It seems that Hadoop streaming will take
separate input files and allocate a new task per file (up to the maximum
constraint) but if given a single input file it is unclear as to whether a
new task is allocated per file or line. My understanding of Hadoop Java is
that (unlike Hadoop streaming) when given a single input file, the file will
be broken up into separate lines and the maximum number of map tasks will
automagically be allocated to handle the lines of the file (assuming the use
of TextInputFormat).

Can someone clarify this?

Thanks,
SD

  




Hadoop Streaming Semantics

2009-01-29 Thread S D
Hello,

I have a clarifying question about Hadoop streaming. I'm new to the list and
didn't see anything posted that covers my questions - my apologies if I
overlooked a relevant post.

I have an input file consisting of a list of files (one per line) that need
to be processed independently of each other. The duration for processing
each file is significant - perhaps an hour each. I'm using Hadoop streaming
without a reduce function to process each file and save the results (back to
S3 native in my case). To handle to long processing time of each file I've
set mapred.task.timeout=0 and I have a pretty straight forward Ruby script
reading from STDIN:

STDIN.each_line do |line|
   # Get file from contents of line
   # Process file (long running)
end

Currently I'm using a cluster of 3 workers in which each worker can have up
to 2 tasks running simultaneously. I've noticed that if I have a single
input file with many lines (more than 6 given my cluster), then not all
workers will be allocated tasks; I've noticed two workers being allocated
one task each and the other worker sitting idly. If I split my input file
into multiple files (at least 6) then all workers will be immediately
allocated the maximum number of tasks that they can handle.

My interpretation on this is fuzzy. It seems that Hadoop streaming will take
separate input files and allocate a new task per file (up to the maximum
constraint) but if given a single input file it is unclear as to whether a
new task is allocated per file or line. My understanding of Hadoop Java is
that (unlike Hadoop streaming) when given a single input file, the file will
be broken up into separate lines and the maximum number of map tasks will
automagically be allocated to handle the lines of the file (assuming the use
of TextInputFormat).

Can someone clarify this?

Thanks,
SD