Hassen, I've been very succesful using Hadoop Streaming, Dumbo, and TypedBytes as a solution for using python to implement mappers and reducers.
TypedBytes is a hadoop encoding format that allows binary data (including lists and maps) to be encoded in a format that permits the serialized data to safely be passed to mappers/reducers via the command line through hadoop streaming. Dumbo is a python library which makes it easy to implement your mappers and reducers in python. In particular, it handles decoding the data encoded as typedbytes to native python types. J On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote: > Hassen, > > > I have lots of binary data that I parse using Python streaming. > > > The way I do this is stream the binary data into sequence files (the > binary data object I save in the key and (null) as the value). > > > Each key then gets written back to me line by line, key by key for an > entire block when streaming. > > > To have this work in streaming on the command line you need to > use -inputformat SequenceFileAsTextInputFormat > > > To create the sequence files I have a jar file that goes from > BufferedReader and writes to org.apache.hadoop.io.SequenceFile.Writer > > > I am not sure if you can do this for your data but if not then make > your own InputFormat. > > > good luck! > > > /* > Joe Stein > http://www.linkedin.com/in/charmalloc > Twitter: @allthingshadoop > */ > > On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <hassen.ri...@cern.ch> > wrote: > Dear all, > > Is it possible to have a binary input to a map code written in > python? > > Thank you > Hassen > > >