[ http://issues.apache.org/jira/browse/NUTCH-171?page=all ]
Rod Taylor updated NUTCH-171:
-----------------------------
Attachment: multi_segment.patch
Perhaps -numFetchers should be renamed to -numSegments ?
> Bring back multiple segment support for Generate / Update
> ---------------------------------------------------------
>
> Key: NUTCH-171
> URL: http://issues.apache.org/jira/browse/NUTCH-171
> Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Rod Taylor
> Priority: Minor
> Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have
> multiple independent segments to work with (lower overhead) -- then run
> update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided
> segments again.
> Radu Mateescu wrote the attached patch for us with the below description
> (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of
> reduce tasks in order to generate a given number of fetch lists. Basically,
> what it does is this: before the second reduce (map-reduce is applied twice
> for generate), it sets the number of reduce tasks to numFetchers and ideally,
> because each reduce will create a file like part-00000, part-00001, etc in
> the ndfs, we'll end up with the number of desired fetched lists. But this
> behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments
> somebody wants to create. The number of reduce tasks should be chosen based
> on the physical topology rather then the number of segments someone might
> want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property,
> the numFetchers seems to be ignored
>
> Therefore , I changed this behaviour to work like this:
> - generate will create numFetchers segments
> - each reduce task will write in all segments (assuming there are enough
> values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122228 Client connection to 192.168.0.1:5466: starting
> 060111 122228 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/20060111122144-0 <dir>
> /user/root/segments/20060111122144-1 <dir>
>
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/20060111122144-0/crawl_generate/part-00000 1276
> /user/root/segments/20060111122144-0/crawl_generate/part-00001 1289
> /user/root/segments/20060111122144-0/crawl_generate/part-00002 1858
>
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/20060111122144-1/crawl_generate/part-00000 1207
> /user/root/segments/20060111122144-1/crawl_generate/part-00001 1236
> /user/root/segments/20060111122144-1/crawl_generate/part-00002 1841
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira