RE: streaming + binary input/output data?

Joydeep Sen Sarma Mon, 07 Apr 2008 09:26:21 -0700

actually - this is possible - but changes to streaming are required.

at one point - we had gotten rid of the '\n' and '\t' separators between the 
keys and the values in the streaming code and streamed byte arrays directly to 
scripts (and then decoded them in the script). it worked perfectly fine. (in 
fact we were streaming thrift generated byte streams - encoded in java land and 
decoded in python land :-))

the binary data on hdfs is best stored as sequencefiles (if u store binary data 
in (what looks to hadoop as) a text file - then bad things will happen). if 
stored this way - hadoop doesn't care about newlines and tabs - those are 
purely artifacts of streaming.

also - the streaming code (for unknown reasons) doesn't allow a 
SequencefileInputFormat. there were minor tweaks we had to make to the 
streaming driver to allow this stuff ..

-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Mon 4/7/2008 7:43 AM
To: core-user@hadoop.apache.org
Subject: Re: streaming + binary input/output data?

I don't think that binary input works with streaming because of the
assumption of one record per line.

If you want to script map-reduce programs, would you be open to a Groovy
implementation that avoids these problems?

On 4/7/08 6:42 AM, "John Menzer" <[EMAIL PROTECTED]> wrote:

> 
> hi,
> 
> i would like to use binary input and output data in combination with hadoop
> streaming.
> 
> the reason why i want to use binary data is, that parsing text to float
> seems to consume a big lot of time compared to directly reading the binary
> floats.
> 
> i am using a C-coded mapper (getting streaming data from stdin and writing
> to stdout) and no reducer.
> 
> so my question is: how do i implement binary input output in this context?
> as far as i understand i need to put an '\n' char at the end of each
> binary-'line'. so hadoop knows how to split/distribute the input data among
> the nodes and how to collect it for output(??)
> 
> is this approach reasonable?
> 
> thanks,
> john

RE: streaming + binary input/output data?

Reply via email to