On Wed, May 6, 2009 at 1:46 PM, Foss User <foss...@gmail.com> wrote: > Thanks for your response. I got a few more questions regarding > optimizations. > > 1. Does hadoop clients locally cache the data it last requested? >
I don't know the DFS read path very well, but I don't believe there is any built in cache here. There is an undocumented configuration variable dfs.read.prefetch.size which affects DFSClient's prefetching of data ahead of the current file position, but I don't want to give any answer I'm not certain of. Hopefully someone else will chime in here. I will answer that there is no *large* cache of data locally. HDFS is optimized for sequential reads, where a cache is generally useless if not detrimental. > 2. Is the meta data for file blocks on data node kept in the > underlying OS's file system on namenode or is it kept in RAM of the > name node? > The block locations are kept in the RAM of the name node, and are updated whenever a Datanode does a "block report". This is why the namenode is in "safe mode" at startup until it has received block locations for some configurable percentage of blocks from the datanodes. > > 3. If no mapper more mapper functions can be run on the node that > contains the data on which the mapper has to act on, is Hadoop > intelligent enough to run the new mappers on some machines within the > same rack? > Yes, assuming you have configured a network topology script. Otherwise, Hadoop has no magical knowledge of your network infrastructure, and it treats the whole cluster as a single rack called /default-rack > > 4. When can a case like the above happen? I mean when can it happen > that the maximum number of mappers for a tasktracker configure has > been reached but Hadoop still needs to start more mappers? > If you have file with 100 blocks all on the same three nodes, but you have a six node cluster, it will schedule some tasks on nodes that do not contain the blocks, since it would rather keep the cluster utilized than keep all data access local. > > 5. Are the multiple mappers and reducers run as separate threads > within the same TaskTracker process? > No, they are run as child processes. -Todd