Markus Jelsma created NUTCH-2368:
------------------------------------

             Summary: Variable generate.max.count and fetcher.server.delay
                 Key: NUTCH-2368
                 URL: https://issues.apache.org/jira/browse/NUTCH-2368
             Project: Nutch
          Issue Type: Improvement
          Components: generator
    Affects Versions: 1.12
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.13
         Attachments: NUTCH-2368.patch

In some cases we need to use host specific characteristics in determining crawl 
speed and bulk sizes because with our (Openindex) settings we can just recrawl 
host with up to 800k urls.

This patch solves the problem by introducing the HostDB to the Generator and 
providing powerful Jexl expressions. Check these two expressions added to the 
Generator:

{code}
-Dgenerate.max.count.expr='
if (unfetched + fetched > 800000) {
  return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 500) 
/ 1000) * conf.getInt("fetcher.threads.per.queue", 1)
} else {
  return conf.getDouble("generate.max.count", 300);
}'

-Dgenerate.fetch.delay.expr='
if (unfetched + fetched > 800000) {
  return (pct95._rs_ + 500) * 1000;
} else {
  return conf.getDouble("fetcher.server.delay", 1000)
}'
{code}

For each large host: select as many records as possible that are possible to 
fetch based on number of threads, 95th percentile response time of the fetch 
limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.

The second expression just follows up to that, settings the crawlDelay of the 
fetch queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to