[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846174#action_12846174 ]
Andrzej Bialecki commented on NUTCH-762: ----------------------------------------- In case of users generating just 1 segment at a time it's an unexpected loss of flexibility. You can't run this version of Generator twice without first completing _both_ fetching & updating of all segments from the previous run - because some of the same urls would be generated in the next round. The point of generate.update.crawldb is to be able to freely interleave generate/update steps. E.g. the following scenario breaks in a non-obvious way: * generate 10 segments * fetch & update 8 of them * realize you need more rounds due to e.g. gone pages * generate additional 10 segments ..kaboom! now the new segments partially overlap with the unfetched 2 segments from the previous generation, and you are going to fetch some urls twice. > Alternative Generator which can generate several segments in one parse of the > crawlDB > ------------------------------------------------------------------------------------- > > Key: NUTCH-762 > URL: https://issues.apache.org/jira/browse/NUTCH-762 > Project: Nutch > Issue Type: New Feature > Components: generator > Affects Versions: 1.0.0 > Reporter: Julien Nioche > Assignee: Julien Nioche > Attachments: NUTCH-762-v2.patch > > > When using Nutch on a large scale (e.g. billions of URLs), the operations > related to the crawlDB (generate - update) tend to take the biggest part of > the time. One solution is to limit such operations to a minimum by generating > several fetchlists in one parse of the crawlDB then update the Db only once > on several segments. The existing Generator allows several successive runs by > generating a copy of the crawlDB and marking the URLs to be fetched. In > practice this approach does not work well as we need to read the whole > crawlDB as many time as we generate a segment. > The patch attached contains an implementation of a MultiGenerator which can > generate several fetchlists by reading the crawlDB only once. The > MultiGenerator differs from the Generator in other aspects: > * can filter the URLs by score > * normalisation is optional > * IP resolution is done ONLY on the entries which have been selected for > fetching (during the partitioning). Running the IP resolution on the whole > crawlDb is too slow to be usable on a large scale > * can max the number of URLs per host or domain (but not by IP) > * can choose to partition by host, domain or IP > Typically the same unit (e.g. domain) would be used for maxing the URLs and > for partitioning; however as we can't count the max number of URLs by IP > another unit must be chosen while partitioning by IP. > We found that using a filter on the score can dramatically improve the > performance as this reduces the amount of data being sent to the reducers. > The MultiGenerator is called via : nutch > org.apache.nutch.crawl.MultiGenerator ... > with the following options : > MultiGenerator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers > numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] > where most parameters are similar to the default Generator - apart from : > -noNorm (explicit) > -topN : max number of URLs per segment > -maxNumSegments : the actual number of segments generated could be less than > the max value select e.g. not enough URLs are available for fetching and fit > in less segments > Please give it a try and less me know what you think of it > Julien Nioche > http://www.digitalpebble.com > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.