[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372602 ]
Rod Taylor commented on NUTCH-171: ---------------------------------- > How is a 200M url segment unweildy? There are two reasons why I have found this. First, Nutch still has a bad habit of not completing a segment once in a while. It retries the component that failed, but after a second or third failure it throws away the entire job. The second is by design. Nutch sucks at IO and much prefers it when the data is in memory -- most programs are that way. 50 small segments barely needs to touch disk (initial creation and final write), infact I could run them on diskless machines with gobs of memory, but a single segment 50 times the size uses a significant amount of IO for temporary work space. I have not yet figured out how to get the map/reduce settings for a large segment to have the same IO patterns as small segments. > Bring back multiple segment support for Generate / Update > --------------------------------------------------------- > > Key: NUTCH-171 > URL: http://issues.apache.org/jira/browse/NUTCH-171 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Rod Taylor > Priority: Minor > Attachments: multi_segment.patch > > We find it convenient to be able to run generate once for -topN 300M and have > multiple independent segments to work with (lower overhead) -- then run > update on all segments which succeeded simultaneously. > This reactivates -numFetchers and fixes updatedb to handle multiple provided > segments again. > Radu Mateescu wrote the attached patch for us with the below description > (lightly edited): > The implementation of -numFetchers in 0.8 improperly plays with the number of > reduce tasks in order to generate a given number of fetch lists. Basically, > what it does is this: before the second reduce (map-reduce is applied twice > for generate), it sets the number of reduce tasks to numFetchers and ideally, > because each reduce will create a file like part-00000, part-00001, etc in > the ndfs, we'll end up with the number of desired fetched lists. But this > behaviour is incorrect for the following reasons: > 1. the number of reduce tasks is orthogonal to the number of segments > somebody wants to create. The number of reduce tasks should be chosen based > on the physical topology rather then the number of segments someone might > want in ndfs > 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, > the numFetchers seems to be ignored > > Therefore , I changed this behaviour to work like this: > - generate will create numFetchers segments > - each reduce task will write in all segments (assuming there are enough > values to be written) in a round-robin fashion > The end results for 3 reduce tasks and 2 segments will look like this : > > /opt/nutch/bin>./nutch ndfs -ls segments > 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml > 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml > 060111 122228 Client connection to 192.168.0.1:5466: starting > 060111 122228 No FS indicated, using default:master:5466 > Found 2 items > /user/root/segments/20060111122144-0 <dir> > /user/root/segments/20060111122144-1 <dir> > > /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate > 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml > 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml > 060111 122318 No FS indicated, using default:master:5466 > 060111 122318 Client connection to 192.168.0.1:5466: starting > Found 3 items > /user/root/segments/20060111122144-0/crawl_generate/part-00000 1276 > /user/root/segments/20060111122144-0/crawl_generate/part-00001 1289 > /user/root/segments/20060111122144-0/crawl_generate/part-00002 1858 > > /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate > 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml > 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml > 060111 122334 Client connection to 192.168.0.1:5466: starting > 060111 122334 No FS indicated, using default:master:5466 > Found 3 items > /user/root/segments/20060111122144-1/crawl_generate/part-00000 1207 > /user/root/segments/20060111122144-1/crawl_generate/part-00001 1236 > /user/root/segments/20060111122144-1/crawl_generate/part-00002 1841 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
