On Thu, Dec 11, 2008 at 2:34 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Todd Lipcon wrote:
>
>

>
>
>> I tried setting down the max.per.host configurations to a few thousand,
>> which helped distribute the crawl activity over the length of the job, but
>> I
>> was still left with a lot of "straggler" hosts for the last few hours -
>> mostly hosts that had set their crawl delay much higher.
>>
>
> Did you try to decrease the max crawl delay? Fetcher can drop items with
> too high crawl delay, which could help.
>

The issue I see with decreasing max crawl delay is that it essentially
blacklists those hosts. Even if I can only crawl these hosts 1/10th as fast,
I'd still like to have them in my index. I guess this is where the hostdb
will help once that jira is implemented, so this kind of thing can be taken
into account at generate time.



>
>
>> I see the main issues as:
>>
>> 1) The queueing mechanism in Fetcher2 is a bit clunky since you can't
>> configure (afaik) the depth of the queue. Even at the start of the job I
>> have the vast majority of fetcher threads spinning even though there are
>> thousands of unique hosts left to fetch from.
>>
>> 2) There is no good way of dealing with disparity in crawl delay between
>> hosts. If there are 1000 URLs from host A and host B in the crawl, but
>> host
>> B has a 30 second crawl delay, the minimum job time is goig to be 30000
>> seconds.
>>
>> I'm working on #1 in my new Fetcher implementation (for which there's a
>> JIRA) and I'd propose solving #2 like this:
>>
>> - Add a new configuration:
>>  key: fetcher.abandon.queue.percentage
>>  default: 0
>>  description: once the number of unique hosts left in the fetch queue
>> drops
>> below this percentage of the number of fetcher threads divided by the
>> number
>> of fetchers per host, abandon the remaining items by putting them in
>> "retry
>> 0" state
>>
>
> With this strategy you need to add a mechanism to prevent Fetcher from
> dropping the items even if the absolute remaining time is still acceptable
> ... think about short queues and small fetchlists.
>

Yep - I imagine there would have to be another configurable parameter for
amount of time to allow in the "too slow" state before actually terminating
the job.


>
>
>> and (to help with both 1 and 2):
>>  key: fetcher.queue.lookahead.ratio
>>  default: 50 (from Fetcher2.java:875)
>>  description: the maximum number of fetch items that will be queued ahead
>> of where the fetcher is currently working. By setting this value high, the
>> fetcher can do a better job of keeping all of its fetch threads busy at
>> the
>> expense of more RAM. Estimated RAM usage is sizeof(CrawlDatum) *
>> numfetchers
>> * lookaheadratio * some small overhead.
>>
>
> Current implementation (in Fetcher2) tries to do just that, see QueueFeeder
> thread. Or did you have something else in mind?
>

Sorry - I just had in mind making that factor of 50 a tunable parameter for
those who have more RAM available.


>
>
>>  My guess is that the size of a CrawlDatum is at worst 1KB, so with
>> typical
>> values of fetch thread count <100, the default ratio of 50 allocates only
>> 5MB of RAM for the fetch queue. I'd personally rather tune this up 20x or
>> so
>> since Fetcher is otherwise not that RAM intensive.
>>
>> Any thoughts on this? My end goal: given a segment with topN=1M and
>> max.per.host=10K, 100 fetcher threads, and crawl.delay = 1sec, I'd expect
>> the total fetch time (ignoring distribution) to be max in the range of
>> 10,000 seconds, with expected time much lower since most hosts won't hit
>> the
>> max.per.host limit.
>>
>
> Crawl delay is not the only factor that slows down the fetching. One common
> issue is hosts that relatively quickly accept connections, but then very
> slowly trickle the data byte by byte. Since we have per-host queues it would
> make sense to monitor not only the req/sec rate but also bytes/sec rate, and
> if it falls too low then terminate the queue.
>
>
Yep - though, similar to the issue with max crawl delay, I don't want to
terminate the queue. Rather I just want the queue to be marked as
"terminable" such that it won't hold up map task completion when it's the
only queue left. If there are other queues still making reasonable progress
we should keep trickling those bytes from the slow host.

-Todd

Reply via email to