Hey, On Wed, Dec 29, 2010 at 8:36 PM, Hiller, Dean (Contractor) <dean.hil...@broadridge.com> wrote: > > First Question: Is that normal behavior? (I know I can keep rebalancing but > that can get tedious being manual and all). >
This is normal if you were loading the data from Node1. This is an optimization, to avoid an additional network cost while loading, if a DataNode location is locally available. > > Second Question: We have huge files sent to us every night over ftp and most > likely, we may mount hdfs from linux so as the file comes in, it would be > written to hdfs. Is there a way to configure hdfs to be writing some of the > file to one node and some of the file to another node? > No. APIs to perform pushing into HDFS give no control on where the data should land, but if you're able pull from multiple sources then you can achieve this sort of a thing (or push from a remote location, not from a node that runs a DataNode). > > Third: If I can do the 2nd question, I am hoping map/reduce can cope with > splitting the file(I copied and modified LineRecordReader to suit our > needs…key is not line number of file and is more just generated)….I guess > ideally, I am hoping for more parallelization here so these big files are > processed on multiple nodes and my map/reduce is written so the map jobs > should be running close to where the data is being written too(and processed > as well after the write), but of course far from the input files which are > most likely not on the same nodes as where the data will be stored. > Data local map tasks would depend on your data cluster setup. A good replication factor is required to help making Maps efficient. While MapReduce tries really hard to get the best data local score, it may affect performance if your replication factor is simply 1 and almost all the blocks of your input reside on the same node. HTH. -- Harsh J www.harshj.com