Observing a few emails on this list, I think the following email exchange between me and john may be of interest to a broader audience.
Runping ________________________________ From: Runping Qi Sent: Sunday, April 13, 2008 8:58 AM To: 'JJ' Subject: RE: streaming + binary input/output data? That is basically what I envisioned originally. One issue is the data format of streaming mapper output and the format of streaming reducer output. Those data are parsed by the streaming framework into key/value pairs. The framework assumes that the key and values are separated by tab char, and the key/value pairs are separated by newline "\n". That means the keys and values cannot have those two chars. If the mapper and the reducer can encodet hose chars, then it will be fine. Encoding the values with base64 will do it. Things related to keys are a bit tricky, since the framework need will apply compare function on them in order to do the sorting (and partition). However, in most cases, it will be acceptable to avoid binary data for keys. Another issue is to read binary input data and write binary data to dfs. This issue can be addressed by implementing customer InputFormat and OutputFormat classes (only the users know how to parse a specific binary data format). For each input key/value pair, the streaming framework basically writes the following to the stdin of the streaming mapper: Key.toString() + "\t" + value.toString() " \n" As long as you implement the toString methods to ensure proper base64 encoding for the value (and the key if necessary), then you will be fine. So, in summary, all these issues can be addressed by the user's code. Initially, I was wondering whether the framework can be extended somehow so that the user may only need to set some configuration variables to handle binary data. However, it seems that it is still unclear what extension should be for a broad classes of applications. Maybe it is the best approach for each user to do something like what I outlined above to address his/her specific problem. Hope this helps. Runping ________________________________ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of JJ Sent: Sunday, April 13, 2008 8:18 AM To: Runping Qi Subject: Re: streaming + binary input/output data? thx for the info, what do you think about the idea of encoding the binary data with base64 to text before streaming it with hadoop? John 2008/4/13, Runping Qi <[EMAIL PROTECTED]>: No implementation/solution yet. If there are more real use cases/user interests, then somebody may have enough interest to provide a patch. Runping > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Sunday, April 13, 2008 7:30 AM > To: Runping Qi > Subject: RE: streaming + binary input/output data? > > i just read the jira. these are interestin suggestions, but how do they > translate into a solution for my problem/question? has all or at least > some of this been implemented or not? > > thx > John > > Runping Qi wrote: > > > > > > Actually, there is an old jira about the same issue: > > https://issues.apache.org/jira/browse/HADOOP-1722 > > > > Runping > > > > > >> -----Original Message----- > >> From: John Menzer [mailto:[EMAIL PROTECTED] > >> Sent: Saturday, April 12, 2008 2:45 PM > >> To: core-user@hadoop.apache.org > >> Subject: RE: streaming + binary input/output data? > >> > >> > >> so you mean you changed the hadoop streaming source code? > >> actually i am not really willing to change the source code if it's not > >> necessary. > >> > >> so i thought about simply encoding the input binary data to txt (e.g. > > with > >> base64) and then adding a '\n' after each line to make it splittable > > for > >> streaming. > >> after reading from stdin my C programm would just have to decode it > >> map/reduce it and then encode it back to base64 so write to stdout. > >> > >> what do you think about that? worth a try? > >> > >> > >> > >> Joydeep Sen Sarma wrote: > >> > > >> > actually - this is possible - but changes to streaming are required. > >> > > >> > at one point - we had gotten rid of the '\n' and '\t' separators > > between > >> > the keys and the values in the streaming code and streamed byte > > arrays > >> > directly to scripts (and then decoded them in the script). it worked > >> > perfectly fine. (in fact we were streaming thrift generated byte > > streams > >> - > >> > encoded in java land and decoded in python land :-)) > >> > > >> > the binary data on hdfs is best stored as sequencefiles (if u store > >> binary > >> > data in (what looks to hadoop as) a text file - then bad things will > >> > happen). if stored this way - hadoop doesn't care about newlines and > >> tabs > >> > - those are purely artifacts of streaming. > >> > > >> > also - the streaming code (for unknown reasons) doesn't allow a > >> > SequencefileInputFormat. there were minor tweaks we had to make to > > the > >> > streaming driver to allow this stuff .. > >> > > >> > > >> > -----Original Message----- > >> > From: Ted Dunning [mailto:[EMAIL PROTECTED] > >> > Sent: Mon 4/7/2008 7:43 AM > >> > To: core-user@hadoop.apache.org > >> > Subject: Re: streaming + binary input/output data? > >> > > >> > > >> > I don't think that binary input works with streaming because of the > >> > assumption of one record per line. > >> > > >> > If you want to script map-reduce programs, would you be open to a > > Groovy > >> > implementation that avoids these problems? > >> > > >> > > >> > On 4/7/08 6:42 AM, "John Menzer" <[EMAIL PROTECTED]> wrote: > >> > > >> >> > >> >> hi, > >> >> > >> >> i would like to use binary input and output data in combination > > with > >> >> hadoop > >> >> streaming. > >> >> > >> >> the reason why i want to use binary data is, that parsing text to > > float > >> >> seems to consume a big lot of time compared to directly reading the > >> >> binary > >> >> floats. > >> >> > >> >> i am using a C-coded mapper (getting streaming data from stdin and > >> >> writing > >> >> to stdout) and no reducer. > >> >> > >> >> so my question is: how do i implement binary input output in this > >> >> context? > >> >> as far as i understand i need to put an '\n' char at the end of > > each > >> >> binary-'line'. so hadoop knows how to split/distribute the input > > data > >> >> among > >> >> the nodes and how to collect it for output(??) > >> >> > >> >> is this approach reasonable? > >> >> > >> >> thanks, > >> >> john > >> > > >> > > >> > > >> > > >> > >> -- > >> View this message in context: > > http://www.nabble.com/streaming-%2B-binary- > >> input-output-data--tp16537427p16656661.html > >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > > > > Quoted from: > http://www.nabble.com/streaming-%2B-binary-input-output-data-- > tp16537427p16658687.html