[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846174#action_12846174
 ] 

Andrzej Bialecki  commented on NUTCH-762:
-----------------------------------------

In case of users generating just 1 segment at a time it's an unexpected loss of 
flexibility. You can't run this version of Generator twice without first 
completing _both_ fetching & updating of all segments from the previous run - 
because some of the same urls would be generated in the next round. The point 
of generate.update.crawldb is to be able to freely interleave generate/update 
steps.

E.g. the following scenario breaks in a non-obvious way:

* generate 10 segments
* fetch & update 8 of them
* realize you need more rounds due to e.g. gone pages
* generate additional 10 segments

..kaboom! now the new segments partially overlap with the unfetched 2 segments 
from the previous generation, and you are going to fetch some urls twice. 

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -------------------------------------------------------------------------------------
>
>                 Key: NUTCH-762
>                 URL: https://issues.apache.org/jira/browse/NUTCH-762
>             Project: Nutch
>          Issue Type: New Feature
>          Components: generator
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-762-v2.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to