Hello,
I have used the intranet crawl for the following simple task:
Given a list of relevant starturls,
get all documents within the reach of two clicks.
We use this mechanism for monitoring a couple of dozen lists on the
internet.
This was easy using the "-depth" parameter of the crawl tool.
As the number of documents was pretty small, we just recreated that
index from scratch every two weeks.
Now the number of documents has grown,
that is why I would like to implement incremental updates.
I played around with the "whole-web"-mechanism, but I could not see how
I can incrementally update an index while keeping the condition "max
hops from a starturl <=2" true for all documents in the index.
I would really appreciate some advice on that.
Thanks a lot in advance,
Best regards
Karsten
-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general