Bring back multiple segment support for Generate / Update
---------------------------------------------------------

         Key: NUTCH-171
         URL: http://issues.apache.org/jira/browse/NUTCH-171
     Project: Nutch
        Type: Improvement
    Versions: 0.8-dev    
    Reporter: Rod Taylor
    Priority: Minor
 Attachments: multi_segment.patch

We find it convenient to be able to run generate once for -topN 300M and have 
multiple independent segments to work with (lower overhead) -- then run update 
on all segments which succeeded simultaneously.

This reactivates -numFetchers and fixes updatedb to handle multiple provided 
segments again.



Radu Mateescu wrote the attached patch for us with the below description 
(lightly edited):

The implementation of -numFetchers in 0.8 improperly plays with the number of 
reduce tasks in order to generate a given number of fetch lists. Basically, 
what it does is this: before the second reduce (map-reduce is applied twice for 
generate), it sets the number of reduce tasks to numFetchers and ideally, 
because each reduce will create a file like part-00000, part-00001, etc in the 
ndfs, we'll end up with the number of desired fetched lists. But this behaviour 
is incorrect for the following reasons:
1. the number of reduce tasks is orthogonal to the number of segments somebody 
wants to create. The number of reduce tasks should be chosen based on the 
physical topology rather then the number of segments someone might want in ndfs
2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, 
the numFetchers seems to be ignored
 
Therefore , I changed this behaviour to work like this: 
 - generate will create numFetchers segments
 - each reduce task will write in all segments (assuming there are enough 
values to be written) in a round-robin fashion
The end results for 3 reduce tasks and 2 segments will look like this :
 
/opt/nutch/bin>./nutch ndfs -ls segments
060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
060111 122228 Client connection to 192.168.0.1:5466: starting
060111 122228 No FS indicated, using default:master:5466
Found 2 items
/user/root/segments/20060111122144-0    <dir>
/user/root/segments/20060111122144-1    <dir>

 
/opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
060111 122318 No FS indicated, using default:master:5466
060111 122318 Client connection to 192.168.0.1:5466: starting
Found 3 items
/user/root/segments/20060111122144-0/crawl_generate/part-00000  1276
/user/root/segments/20060111122144-0/crawl_generate/part-00001  1289
/user/root/segments/20060111122144-0/crawl_generate/part-00002  1858

 
/opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
060111 122334 Client connection to 192.168.0.1:5466: starting
060111 122334 No FS indicated, using default:master:5466
Found 3 items
/user/root/segments/20060111122144-1/crawl_generate/part-00000  1207
/user/root/segments/20060111122144-1/crawl_generate/part-00001  1236
/user/root/segments/20060111122144-1/crawl_generate/part-00002  1841



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to