Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

Andrzej Bialecki Sun, 20 Apr 2008 13:57:14 -0700

[EMAIL PROTECTED] wrote:

Host extraction from URL makes sense, but there would be no host-level
data in CrawlDatum.  For example, one of the things I'd like to track is
download speed.  I don't want to track that on the per-URL level, but on
a per-host level.  I'd keep track of the d/l speed for each host in Fetcher2

and its FetcherInputQueue (that part is in JIRA already).

So I'm not sure how I'd put the d/l speed for a host in the CrawlDatum....

You really don't have to - see below. The queue monitoring stuff inFetcher gives you only the current fetchlist metrics anyway, so they areincomplete - you need to calculate the actual averages from all urlsfrom that host, and not just the current fetchlist. That's why it'sbetter to do this using the information from CrawlDb and not just fromthe current segment.

So, let's assume for a moment that you don't track the d/l speed perhost in Fetchers, or you discard this information, and assume that youonly add the actually measured per-url download speed to crawl_fetch, aspart of CrawlDatum.metaData. This metadata will be merged to the CrawlDbduring the updatedb operation (replacing any older values if they exist).

Reduce: input: <host, (hostStats1, hostStats2, ...)>
    output:  <host, hostStats> // aggregated



Let's try with a concrete example.
Imagine I just ran a fetch job and that fetched some number of URLs
from www.foo.com and www.bar.com. foo.com aggregate d/l speed for
that fetch run was 50 kbps.  bar.com speed was 20 kbps.

At the end of the run, I'd somehow store, say:
www.foo.comdl_speed:50requests:100timeouts:0
www.bar.comdl_speed:20requests:90timeouts:20


No, what you want to store is this:

www.example.com/page1.html dl_speed:50 status:ok
www.example.com/page2.html dl_speed:45 status:ok
www.example.com/page3.html dl_speed:0 status:gone
...


Then, I was thinking, something else (some HostDb MapReduce job)
would go through this data stored under segment/2008...../something/
and merge it into crawl/hostdb file.

It sounds like you are saying, this should stick the data in CrawlDatum
and let that be merged into crawl/crawldb.... but I don't see how I'd put
the numbers from the above example into CrawlDatum without
repeating them, so that each URL from www.foo.com has those 3
numbers above for www.foo.com stored in their crawldb entries.

See above - we store only per-url metrics in CrawlDb. Then the HostDbjob aggregates the info from CrawlDb using host name (or domain name, orTLD) as the key.

PS. Could you please wrap your lines to 80 chars? I always have to
re-wrap your emails when responding, otherwise they consists of very
very long lines ..



Sorry about that.  I wrapped them manually here.


Thanks. Mail apps are no longer what they used to be ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

Reply via email to