On Feb 3, 2011, at 9:46 AM, Harsh J wrote:

> Hello,
> 
> On Thu, Feb 3, 2011 at 10:46 PM, Keith Wiley <[email protected]> wrote:
>> I've seen this asked before, but haven't seen a response yet.
>> 
>> If the input to a streaming job is not actual data splits but simple HDFS 
>> file names which are then read by the mappers, then how can data locality be 
>> achieved.
> 
> Also, if you're only looking to not split the files, you can pass in a

The files won't be split, they're only 6MBs.  I'm looking to get the files to 
my streaming job somehow and the method I've chosen is to send mere fileNAMES 
via the streaming API and have the streaming program open the file from HDFS 
through a symbolic link in the distribute cache (the link originating from 
-cacheFile presumably).

> custom FileInputFormat with isSplitable returning false? You'll lose
> completeness in locality cause of blocks not present in the chosen
> node though, yes -- But I believe that adding a hundred files to
> DistributedCache is not the solution, as the DistributedCache data is
> set to ALL the nodes AFAIK.


My understanding is that the -cacheFile option and the 
DistributedCache.addCacheFile() method don't copy the entire file to the 
distributed cache, but rather make tiny symbolic links to the actual HDFS file. 
 Correct?  If you don't think I should add 100s of files to the distributed 
cache (or even 100s of links), then how else can I make the files available to 
my streaming program?

Put another way, do you know of another method by which to permit the streaming 
programs to read files from HDFS?

Thanks.

________________________________________________________________________________
Keith Wiley     [email protected]     keithwiley.com    music.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
                                           --  Homer Simpson
________________________________________________________________________________

Reply via email to