[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362508 ]
Rod Taylor commented on NUTCH-171: ---------------------------------- Overhead of generate/update versus fetch is the big one. A smaller segment size fits easily into memory reducing the amount of disk accesses required. 10 3M batches can be fetched, parsed, and outputted by SegmentReader in a far shorter time than a single 30M batch. When you overlap their execution so that as soon as a tasktracker has space it begins working on the next segment -- no delays. The benefit comes in the reduce time required. Ideally we could overlap segment2 map with segment1 reduce to keep bandwidth usage constant. We are working on a patch to allow specifing a target bandwidth on a per task basis with a varying thread count to try to keep the bandwidth usage maxed out at a defined limit (about 50Mb/sec for us). Being able to specify the number of map tasktrackers separately from the number of reduce tasktrackers would make this possible. With a linear process like crawl there is a 20% to 50% gap in fetching while generate/update run. We want fetches to overlap with these -- again ideally filling our bandwidth. At the beginning it was lower but as the number of URLs in our database grows the overhead grows with it. I don't really care how long generate takes if there is a steady stream of new data being downloaded at the same time. Finally, since we don't use Nutch as a crawler only (no indexer) there is no performance penalty for having a large number of small segments. It makes a number of things like scheduling maintenance, error recovery, debugging errors, and of course gives the previously mentioned IO reduction possible. > Bring back multiple segment support for Generate / Update > --------------------------------------------------------- > > Key: NUTCH-171 > URL: http://issues.apache.org/jira/browse/NUTCH-171 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Rod Taylor > Priority: Minor > Attachments: multi_segment.patch > > We find it convenient to be able to run generate once for -topN 300M and have > multiple independent segments to work with (lower overhead) -- then run > update on all segments which succeeded simultaneously. > This reactivates -numFetchers and fixes updatedb to handle multiple provided > segments again. > Radu Mateescu wrote the attached patch for us with the below description > (lightly edited): > The implementation of -numFetchers in 0.8 improperly plays with the number of > reduce tasks in order to generate a given number of fetch lists. Basically, > what it does is this: before the second reduce (map-reduce is applied twice > for generate), it sets the number of reduce tasks to numFetchers and ideally, > because each reduce will create a file like part-00000, part-00001, etc in > the ndfs, we'll end up with the number of desired fetched lists. But this > behaviour is incorrect for the following reasons: > 1. the number of reduce tasks is orthogonal to the number of segments > somebody wants to create. The number of reduce tasks should be chosen based > on the physical topology rather then the number of segments someone might > want in ndfs > 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, > the numFetchers seems to be ignored > > Therefore , I changed this behaviour to work like this: > - generate will create numFetchers segments > - each reduce task will write in all segments (assuming there are enough > values to be written) in a round-robin fashion > The end results for 3 reduce tasks and 2 segments will look like this : > > /opt/nutch/bin>./nutch ndfs -ls segments > 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml > 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml > 060111 122228 Client connection to 192.168.0.1:5466: starting > 060111 122228 No FS indicated, using default:master:5466 > Found 2 items > /user/root/segments/20060111122144-0 <dir> > /user/root/segments/20060111122144-1 <dir> > > /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate > 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml > 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml > 060111 122318 No FS indicated, using default:master:5466 > 060111 122318 Client connection to 192.168.0.1:5466: starting > Found 3 items > /user/root/segments/20060111122144-0/crawl_generate/part-00000 1276 > /user/root/segments/20060111122144-0/crawl_generate/part-00001 1289 > /user/root/segments/20060111122144-0/crawl_generate/part-00002 1858 > > /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate > 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml > 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml > 060111 122334 Client connection to 192.168.0.1:5466: starting > 060111 122334 No FS indicated, using default:master:5466 > Found 3 items > /user/root/segments/20060111122144-1/crawl_generate/part-00000 1207 > /user/root/segments/20060111122144-1/crawl_generate/part-00001 1236 > /user/root/segments/20060111122144-1/crawl_generate/part-00002 1841 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
