[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362508 ]
Rod Taylor commented on NUTCH-171: ---------------------------------- Overhead of generate/update versus fetch is the big one. A smaller segment size fits easily into memory reducing the amount of disk accesses required. 10 3M batches can be fetched, parsed, and outputted by SegmentReader in a far shorter time than a single 30M batch. When you overlap their execution so that as soon as a tasktracker has space it begins working on the next segment -- no delays. The benefit comes in the reduce time required. Ideally we could overlap segment2 map with segment1 reduce to keep bandwidth usage constant. We are working on a patch to allow specifing a target bandwidth on a per task basis with a varying thread count to try to keep the bandwidth usage maxed out at a defined limit (about 50Mb/sec for us). Being able to specify the number of map tasktrackers separately from the number of reduce tasktrackers would make this possible. With a linear process like crawl there is a 20% to 50% gap in fetching while generate/update run. We want fetches to overlap with these -- again ideally filling our bandwidth. At the beginning it was lower but as the number of URLs in our database grows the overhead grows with it. I don't really care how long generate takes if there is a steady stream of new data being downloaded at the same time. Finally, since we don't use Nutch as a crawler only (no indexer) there is no performance penalty for having a large number of small segments. It makes a number of things like scheduling maintenance, error recovery, debugging errors, and of course gives the previously mentioned IO reduction possible. > Bring back multiple segment support for Generate / Update > --------------------------------------------------------- > > Key: NUTCH-171 > URL: http://issues.apache.org/jira/browse/NUTCH-171 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Rod Taylor > Priority: Minor > Attachments: multi_segment.patch > > We find it convenient to be able to run generate once for -topN 300M and have > multiple independent segments to work with (lower overhead) -- then run > update on all segments which succeeded simultaneously. > This reactivates -numFetchers and fixes updatedb to handle multiple provided > segments again. > Radu Mateescu wrote the attached patch for us with the below description > (lightly edited): > The implementation of -numFetchers in 0.8 improperly plays with the number of > reduce tasks in order to generate a given number of fetch lists. Basically, > what it does is this: before the second reduce (map-reduce is applied twice > for generate), it sets the number of reduce tasks to numFetchers and ideally, > because each reduce will create a file like part-00000, part-00001, etc in > the ndfs, we'll end up with the number of desired fetched lists. But this > behaviour is incorrect for the following reasons: > 1. the number of reduce tasks is orthogonal to the number of segments > somebody wants to create. The number of reduce tasks should be chosen based > on the physical topology rather then the number of segments someone might > want in ndfs > 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, > the numFetchers seems to be ignored > > Therefore , I changed this behaviour to work like this: > - generate will create numFetchers segments > - each reduce task will write in all segments (assuming there are enough > values to be written) in a round-robin fashion > The end results for 3 reduce tasks and 2 segments will look like this : > > /opt/nutch/bin>./nutch ndfs -ls segments > 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml > 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml > 060111 122228 Client connection to 192.168.0.1:5466: starting > 060111 122228 No FS indicated, using default:master:5466 > Found 2 items > /user/root/segments/20060111122144-0 <dir> > /user/root/segments/20060111122144-1 <dir> > > /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate > 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml > 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml > 060111 122318 No FS indicated, using default:master:5466 > 060111 122318 Client connection to 192.168.0.1:5466: starting > Found 3 items > /user/root/segments/20060111122144-0/crawl_generate/part-00000 1276 > /user/root/segments/20060111122144-0/crawl_generate/part-00001 1289 > /user/root/segments/20060111122144-0/crawl_generate/part-00002 1858 > > /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate > 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml > 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml > 060111 122334 Client connection to 192.168.0.1:5466: starting > 060111 122334 No FS indicated, using default:master:5466 > Found 3 items > /user/root/segments/20060111122144-1/crawl_generate/part-00000 1207 > /user/root/segments/20060111122144-1/crawl_generate/part-00001 1236 > /user/root/segments/20060111122144-1/crawl_generate/part-00002 1841 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
