I would like my streaming job to receive the names of files stored on HDFS, but 
not the actual contents of the files, and I would like data locality to be 
honored (I want mappers to run on nodes where the files are located).  Is there 
any way to do this, or does Hadoop only offer data locality if a file's entire 
contents are specified as input to the stdin stream?

My streaming job already works just fine by taking the names of files, and 
pulling the files directly from HDFS to the local node for processing by the 
mapper (and then presumably discarding them from the CWD after the map task 
ends), but I would like to get this to work in a data local manner...and I 
really don't want to have to stream the files over stdin if I can help it.  
They're binary, and the underlying routines read from file paths anyway, so 
even if I could get binary streaming to work (I realize there are methods for 
achieving this), I would have to dump the contents to disk anyway simply so the 
work routines could read the data back in via file, so I don't want the file 
contents over a stream, just its name (and path).

Thanks.

________________________________________________________________________________
Keith Wiley     [email protected]     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________

Reply via email to