Sandeep Tata wrote:
Hi,
I was wondering what would be the best way to configure per-host
re-crawl intervals. The default db.fetch.interval applies to all URLs,
but I'd like for some hosts to be recrawled more frequently. Is there
a JIRA ticket open on this? I haven't been able to find one
Fetch interval can be set on individual CrawlDatum-s in crawldb, at
least technically speaking. In practice, there is no command-line tool
to do this, and I don;t think there is a JIRA on this.
One idea would be to modify the Injector to accept a list of URL-s with
matching metadata, and among others use a predefined metadata like
fetchInterval. On the initial injection, all values in CrawlDatum would
be set according to the metadata (or set to defaults). On subsequent
injections, if a URL already exists in CrawlDb, its metadata would be
reset to the values supplied in the injector file.
This should be easy to implement, and I think it would support your use
case.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com