[
https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Talat UYARER updated NUTCH-1630:
--------------------------------
Issue Type: Improvement (was: Bug)
> How to achieve finishing fetch approximately at the same time for each queue
> (a.k.a adaptive queue size)
> ---------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-1630
> URL: https://issues.apache.org/jira/browse/NUTCH-1630
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 2.1, 2.2, 2.2.1
> Reporter: Talat UYARER
> Labels: improvement
> Fix For: 2.3
>
>
> Problem Definition:
> When crawling, due to unproportional size of queues; fetching needs to wait
> for a long time for long lasting queues when shorter ones are finished. That
> means you may have to wait for a couple of days for some of queues.
> Normally we define max queue size with generate.max.count but that's a static
> value. However number of URLs to be fetched increases with each depth.
> Defining same length for all queues does not mean all queues will finish
> around the same time. This problem has been addressed by some other users
> before [1]. So we came up with a different approach to this issue.
> Solution:
> Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our
> solution can be applicable to all three mods.
> 1-Define a "fetch workload of current queue" (FW) value for each queue based
> on the previous fetches of that queue.
> We calculate this by:
> FW=average response time of previous depth * number of urls in current
> queue
> 2- Calculate the harmonic mean [2] of all FW's to get the average workload of
> current depth (AW)
> 3- Get the length for a queue by dividing AW by previously known average
> response time of that queue:
> Queue Length=AW / average response time
> Using this algoritm leads to a fetch phase where all queues finish up around
> the same time.
> As soon as posible i will send my patch. Do you have any comments ?
> [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
> [2] In our opinion; harmonic mean is best in our case because our data has a
> few points that are much higher than the rest.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira