Hello, On Thu, Feb 3, 2011 at 10:46 PM, Keith Wiley <[email protected]> wrote: > I've seen this asked before, but haven't seen a response yet. > > If the input to a streaming job is not actual data splits but simple HDFS > file names which are then read by the mappers, then how can data locality be > achieved.
Also, if you're only looking to not split the files, you can pass in a custom FileInputFormat with isSplitable returning false? You'll lose completeness in locality cause of blocks not present in the chosen node though, yes -- But I believe that adding a hundred files to DistributedCache is not the solution, as the DistributedCache data is set to ALL the nodes AFAIK. -- Harsh J www.harshj.com
