actually - this is possible - but changes to streaming are required. at one point - we had gotten rid of the '\n' and '\t' separators between the keys and the values in the streaming code and streamed byte arrays directly to scripts (and then decoded them in the script). it worked perfectly fine. (in fact we were streaming thrift generated byte streams - encoded in java land and decoded in python land :-))
the binary data on hdfs is best stored as sequencefiles (if u store binary data in (what looks to hadoop as) a text file - then bad things will happen). if stored this way - hadoop doesn't care about newlines and tabs - those are purely artifacts of streaming. also - the streaming code (for unknown reasons) doesn't allow a SequencefileInputFormat. there were minor tweaks we had to make to the streaming driver to allow this stuff .. -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Mon 4/7/2008 7:43 AM To: core-user@hadoop.apache.org Subject: Re: streaming + binary input/output data? I don't think that binary input works with streaming because of the assumption of one record per line. If you want to script map-reduce programs, would you be open to a Groovy implementation that avoids these problems? On 4/7/08 6:42 AM, "John Menzer" <[EMAIL PROTECTED]> wrote: > > hi, > > i would like to use binary input and output data in combination with hadoop > streaming. > > the reason why i want to use binary data is, that parsing text to float > seems to consume a big lot of time compared to directly reading the binary > floats. > > i am using a C-coded mapper (getting streaming data from stdin and writing > to stdout) and no reducer. > > so my question is: how do i implement binary input output in this context? > as far as i understand i need to put an '\n' char at the end of each > binary-'line'. so hadoop knows how to split/distribute the input data among > the nodes and how to collect it for output(??) > > is this approach reasonable? > > thanks, > john