Hassen,

I've been very succesful using Hadoop Streaming, Dumbo, and TypedBytes
as a solution for using python to implement mappers and reducers.

TypedBytes is a hadoop encoding format that allows binary data
(including lists and maps) to be encoded in a format that permits the
serialized data to safely be passed to mappers/reducers via the command
line through hadoop streaming.

Dumbo is a python library which makes it easy to implement your mappers
and reducers in python. In particular, it handles decoding the data
encoded as typedbytes to native python types.

J
On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote:
> Hassen, 
> 
> 
> I have lots of binary data that I parse using Python streaming.
> 
> 
> The way I do this is stream the binary data into sequence files (the
> binary data object I save in the key and (null) as the value).
> 
> 
> Each key then gets written back to me line by line, key by key for an
> entire block when streaming.
> 
> 
> To have this work in streaming on the command line you need to
> use -inputformat SequenceFileAsTextInputFormat
> 
> 
> To create the sequence files I have a jar file that goes from
> BufferedReader and writes to org.apache.hadoop.io.SequenceFile.Writer
> 
> 
> I am not sure if you can do this for your data but if not then make
> your own InputFormat.
> 
> 
> good luck!
> 
> 
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> Twitter: @allthingshadoop
> */
> 
> On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <hassen.ri...@cern.ch>
> wrote:
>         Dear all,
>         
>         Is it possible to have a binary input to a map code written in
>         python?
>         
>         Thank you
>         Hassen
> 
> 
> 

Reply via email to