Karsten Dello wrote:
Hello,

I have used the intranet crawl for the following simple task:
Given a list of relevant starturls,
get all documents within the reach of two clicks.
We use this mechanism for monitoring a couple of dozen lists on the internet.
This was easy using the "-depth" parameter of the crawl tool.
As the number of documents was pretty small, we just recreated that index from scratch every two weeks.
Now the number of documents has grown,
that is why I would like to implement incremental updates.

I played around with the "whole-web"-mechanism, but I could not see how I can incrementally update an index while keeping the condition "max hops from a starturl <=2" true for all documents in the index.
I would really appreciate some advice on that.

In your case I think you have to store this information in CrawlDatum.metadata, and then implement a scoring plugin that does something like this (all other methods can be pass-through):

public float generatorSortValue(UTF8 url, CrawlDatum datum, float initSort) throws ScoringFilterException { IntWritable hopCount = (IntWritable)datum.getMetaData().get(new UTF8("hopCount"));
   if (hopCount == null) return initSort;
   if (hopCount.get() >= 2) return 0;
   return initSort;
 }

This means that URLs with hopCount >= 2 won't ever be selected for fetching.

Of course you need to pass this value around, and increment it appropriately when you insert newly discovered URLs, but the ScoringFilter API should be sufficient for this - you can always pass around the values in some metadata, if not in CrawlDatum then in Content or ParseData.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to