On Feb 3, 2011, at 11:03 PM, Keith Wiley wrote:
> Our data consists of a few million binary (image) files, only about 6MB each
> uncompressed, 2.5MB gzipped. A typical job does not need to touch every
> image, rather only a tiny subset of the images is required. In this sense
> our job is unlike a canonical Hadoop job which combs an entire database. As
> such our current implementation (not streaming, but "normal" Hadoop) first
> uses a SQL query against a nonHadoop database to find the names of the
> relevant images for a given job, then sets up input splits for those
> individual files and assigns them as input to the Hadoop job.
Why can't you just process the entire dataset, skipping the records you
don't care about? In other words, your input to your streaming job would be
all the files. But you could send a list of the actual images you care about
as a dcache file. Then you'd get locality while also only processing the ones
you need.
[This is a very common pattern, BTW.]
> ....
> Which brings me to streaming. Serendipitously, our system seems to work
> considerably better if I don't use Java mappers and JNI, but rather using
> streaming to run a small native program that then coordinates the calls into
> the native library (further evidence that the problem lies somewhere in JNI's
> functionality). I am not sure how to arrange our input data for streaming
> however. It is binary of course, which has been a long-standing issue (I
> believe) with Hadoop streams since the assumption is that the data is 8-bit
> clean with endline record terminators.
I'm fairly certain this is actually fixed in either 0.21 or trunk. So
this long standing issue will soon go away. :D
> Furthermore, I'm not really sure how I would indicate which files are the
> input for a given job since I don't know how to write a driver program for
> streaming, I only know how to use the streaming jar directly from the command
> line (so there is no driver of my choosing, I can't control the driver at
> all, its the "streaming" driver and all I can do is give it my mapper and
> reducer native programs to run. You see my problem? With streaming I have
> no driver, just the mappers and reducers, but with a Java Hadoop program, I
> have all three and I can do useful preprocessing in the driver, like setting
> up the input paths to the subset of images comprising the input dataset for a
> given job. This is quite perplexing for me. How do I use streams
> effectively if I can't write my own driver?
Again, I think a way around this dilemma is to basically do record
skipping. It is understood that streaming has a lot of limitations. The fact
that it works at all is pretty amazing. :)