I'd like to +1 to using Dumbo for all things Python and Hadoop MapReduce. Its one of the better ways to do things.
Do look at the initial conversation here: http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.html as well. The feature/bug fixes specified in the post are present in Apache Hadoop 0.21 (which isn't deemed to be suited for production use yet) and is also available in other (in-production-use) Hadoop distributions such as Cloudera's, which is based off on 0.20.2: https://ccp.cloudera.com/display/SUPPORT/Downloads On Tue, Jun 21, 2011 at 10:43 AM, Jeremy Lewi <jer...@lewi.us> wrote: > Hassen, > > I've been very succesful using Hadoop Streaming, Dumbo, and TypedBytes > as a solution for using python to implement mappers and reducers. > > TypedBytes is a hadoop encoding format that allows binary data > (including lists and maps) to be encoded in a format that permits the > serialized data to safely be passed to mappers/reducers via the command > line through hadoop streaming. > > Dumbo is a python library which makes it easy to implement your mappers > and reducers in python. In particular, it handles decoding the data > encoded as typedbytes to native python types. > > J > On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote: >> Hassen, >> >> >> I have lots of binary data that I parse using Python streaming. >> >> >> The way I do this is stream the binary data into sequence files (the >> binary data object I save in the key and (null) as the value). >> >> >> Each key then gets written back to me line by line, key by key for an >> entire block when streaming. >> >> >> To have this work in streaming on the command line you need to >> use -inputformat SequenceFileAsTextInputFormat >> >> >> To create the sequence files I have a jar file that goes from >> BufferedReader and writes to org.apache.hadoop.io.SequenceFile.Writer >> >> >> I am not sure if you can do this for your data but if not then make >> your own InputFormat. >> >> >> good luck! >> >> >> /* >> Joe Stein >> http://www.linkedin.com/in/charmalloc >> Twitter: @allthingshadoop >> */ >> >> On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <hassen.ri...@cern.ch> >> wrote: >> Dear all, >> >> Is it possible to have a binary input to a map code written in >> python? >> >> Thank you >> Hassen >> >> >> > > -- Harsh J