I'm trying these solutions...Thanks for suggestions.
I'd like to +1 to using Dumbo for all things Python and Hadoop
MapReduce. Its one of the better ways to do things.
Do look at the initial conversation here:
http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.html
as well.
The feature/bug fixes specified in the post are present in Apache
Hadoop 0.21 (which isn't deemed to be suited for production use yet)
and is also available in other (in-production-use) Hadoop
distributions such as Cloudera's, which is based off on 0.20.2:
https://ccp.cloudera.com/display/SUPPORT/Downloads
On Tue, Jun 21, 2011 at 10:43 AM, Jeremy Lewi <jer...@lewi.us> wrote:
Hassen,
I've been very succesful using Hadoop Streaming, Dumbo, and
TypedBytes
as a solution for using python to implement mappers and reducers.
TypedBytes is a hadoop encoding format that allows binary data
(including lists and maps) to be encoded in a format that permits the
serialized data to safely be passed to mappers/reducers via the
command
line through hadoop streaming.
Dumbo is a python library which makes it easy to implement your
mappers
and reducers in python. In particular, it handles decoding the data
encoded as typedbytes to native python types.
J
On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote:
Hassen,
I have lots of binary data that I parse using Python streaming.
The way I do this is stream the binary data into sequence files (the
binary data object I save in the key and (null) as the value).
Each key then gets written back to me line by line, key by key for
an
entire block when streaming.
To have this work in streaming on the command line you need to
use -inputformat SequenceFileAsTextInputFormat
To create the sequence files I have a jar file that goes from
BufferedReader and writes to
org.apache.hadoop.io.SequenceFile.Writer
I am not sure if you can do this for your data but if not then make
your own InputFormat.
good luck!
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/
On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <hassen.ri...@cern.ch>
wrote:
Dear all,
Is it possible to have a binary input to a map code
written in
python?
Thank you
Hassen
--
Harsh J