Could NDFS be easily modified so that the master node sends the Map task to the data replica/task node that actually has the data locally? alleviating network traffic load?
In a scenerio like this the master node could be prepped like google does so that when the job is nearing completion it could spawn off retries of the existing map tasks to other nodes to try and complete the job incase certain nodes are failing for whatever reason. (especially if your processing 64 meg chunks) It would also seem because of the smaller chunk size you could currently run more tasks even on a single node. With todays hardware we could impose an NDFS "file syste" container even on a "local" node based system so you could achieve the benefits of being aware of multiple volumes locally and utilizing these in your storage definition. Something like this on a local system with multiple disk drives to try and utilize all of the io channels/CPU's and such. (for example using a 32 thread Sun T2000 server with multiple attached disks being able to process quite a load in smaller concurrent tasks rather then few larger ones). It appears google chops things into 64meg tasks (the same size as the GFS block size) and perhaps even doing something like that in NDFS may make things a bit quicker to read/write and handle network IO throughput. (especially if the only traffic is ndfs replica traffic and updates on such rath than actual serial io reading remote data) ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
