Sandeep Tata wrote:
Hi,

I was wondering what would be the best way to configure per-host
re-crawl intervals. The default db.fetch.interval applies to all URLs,
but I'd like for some hosts to be recrawled more frequently. Is there
a JIRA ticket open on this? I haven't been able to find one

Fetch interval can be set on individual CrawlDatum-s in crawldb, at least technically speaking. In practice, there is no command-line tool to do this, and I don;t think there is a JIRA on this.

One idea would be to modify the Injector to accept a list of URL-s with matching metadata, and among others use a predefined metadata like fetchInterval. On the initial injection, all values in CrawlDatum would be set according to the metadata (or set to defaults). On subsequent injections, if a URL already exists in CrawlDb, its metadata would be reset to the values supplied in the injector file.

This should be easy to implement, and I think it would support your use case.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to