[EMAIL PROTECTED] wrote:

Host extraction from URL makes sense, but there would be no host-level
data in CrawlDatum.  For example, one of the things I'd like to track is
download speed.  I don't want to track that on the per-URL level, but on
a per-host level.  I'd keep track of the d/l speed for each host in Fetcher2
and its FetcherInputQueue (that part is in JIRA already).
So I'm not sure how I'd put the d/l speed for a host in the CrawlDatum....

You really don't have to - see below. The queue monitoring stuff in Fetcher gives you only the current fetchlist metrics anyway, so they are incomplete - you need to calculate the actual averages from all urls from that host, and not just the current fetchlist. That's why it's better to do this using the information from CrawlDb and not just from the current segment.

So, let's assume for a moment that you don't track the d/l speed per host in Fetchers, or you discard this information, and assume that you only add the actually measured per-url download speed to crawl_fetch, as part of CrawlDatum.metaData. This metadata will be merged to the CrawlDb during the updatedb operation (replacing any older values if they exist).



Reduce: input: <host, (hostStats1, hostStats2, ...)>
    output:  <host, hostStats> // aggregated


Let's try with a concrete example.
Imagine I just ran a fetch job and that fetched some number of URLs
from www.foo.com and www.bar.com. foo.com aggregate d/l speed for
that fetch run was 50 kbps.  bar.com speed was 20 kbps.

At the end of the run, I'd somehow store, say:
www.foo.comdl_speed:50requests:100timeouts:0
www.bar.comdl_speed:20requests:90timeouts:20

No, what you want to store is this:

www.example.com/page1.html dl_speed:50 status:ok
www.example.com/page2.html dl_speed:45 status:ok
www.example.com/page3.html dl_speed:0 status:gone
...


Then, I was thinking, something else (some HostDb MapReduce job)
would go through this data stored under segment/2008...../something/
and merge it into crawl/hostdb file.

It sounds like you are saying, this should stick the data in CrawlDatum
and let that be merged into crawl/crawldb.... but I don't see how I'd put
the numbers from the above example into CrawlDatum without
repeating them, so that each URL from www.foo.com has those 3
numbers above for www.foo.com stored in their crawldb entries.

See above - we store only per-url metrics in CrawlDb. Then the HostDb job aggregates the info from CrawlDb using host name (or domain name, or TLD) as the key.

PS. Could you please wrap your lines to 80 chars? I always have to
re-wrap your emails when responding, otherwise they consists of very
very long lines ..


Sorry about that.  I wrapped them manually here.

Thanks. Mail apps are no longer what they used to be ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to