[ 
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362508 ] 

Rod Taylor commented on NUTCH-171:
----------------------------------

Overhead of generate/update versus fetch is the big one. A smaller segment size 
fits easily into memory reducing the amount of disk accesses required. 10 3M 
batches can be fetched, parsed, and outputted by SegmentReader in a far shorter 
time than a single 30M batch.

When you overlap their execution so that as soon as a tasktracker has space it 
begins working on the next segment -- no delays. The benefit comes in the 
reduce time required. Ideally we could overlap segment2 map with segment1 
reduce to keep bandwidth usage constant.

We are working on a patch to allow specifing a target bandwidth on a per task 
basis with a varying thread count to try to keep the bandwidth usage maxed out 
at a defined limit (about 50Mb/sec for us). Being able to specify the number of 
map tasktrackers separately from the number of reduce tasktrackers would make 
this possible.


With a linear process like crawl there is a 20% to 50% gap in fetching while 
generate/update run. We want fetches to overlap with these -- again ideally 
filling our bandwidth. At the beginning it was lower but as the number of URLs 
in our database grows the overhead grows with it. I don't really care how long 
generate takes if there is a steady stream of new data being downloaded at the 
same time.


Finally, since we don't use Nutch as a crawler only (no indexer) there is no 
performance penalty for having a large number of small segments. It makes a 
number of things like scheduling maintenance, error recovery, debugging errors, 
and of course gives the previously mentioned IO reduction possible.

> Bring back multiple segment support for Generate / Update
> ---------------------------------------------------------
>
>          Key: NUTCH-171
>          URL: http://issues.apache.org/jira/browse/NUTCH-171
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>     Priority: Minor
>  Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have 
> multiple independent segments to work with (lower overhead) -- then run 
> update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided 
> segments again.
> Radu Mateescu wrote the attached patch for us with the below description 
> (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of 
> reduce tasks in order to generate a given number of fetch lists. Basically, 
> what it does is this: before the second reduce (map-reduce is applied twice 
> for generate), it sets the number of reduce tasks to numFetchers and ideally, 
> because each reduce will create a file like part-00000, part-00001, etc in 
> the ndfs, we'll end up with the number of desired fetched lists. But this 
> behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments 
> somebody wants to create. The number of reduce tasks should be chosen based 
> on the physical topology rather then the number of segments someone might 
> want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, 
> the numFetchers seems to be ignored
>  
> Therefore , I changed this behaviour to work like this: 
>  - generate will create numFetchers segments
>  - each reduce task will write in all segments (assuming there are enough 
> values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>  
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122228 Client connection to 192.168.0.1:5466: starting
> 060111 122228 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/20060111122144-0    <dir>
> /user/root/segments/20060111122144-1    <dir>
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/20060111122144-0/crawl_generate/part-00000  1276
> /user/root/segments/20060111122144-0/crawl_generate/part-00001  1289
> /user/root/segments/20060111122144-0/crawl_generate/part-00002  1858
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/20060111122144-1/crawl_generate/part-00000  1207
> /user/root/segments/20060111122144-1/crawl_generate/part-00001  1236
> /user/root/segments/20060111122144-1/crawl_generate/part-00002  1841

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to