Christoph,
If I understand your question correctly, you could solve this in a
rather crude fashion by running two separately-configured Nutch
instances:
1) "Discovery crawler" to crawl the seed domains to a given depth
- Create a custom URLFilter that simply "captures" a discovered
URL: log it off to the side, and return true
- Adjust nutch-site.xml to place this at the head of the URL filter-
chain
- Don't index the pages fetched by this crawler
2) "Content crawler" to crawl the indexable pages
- Just inject all of the URLs fetched by crawler (1)
- Limit depth=0 if you intend to refresh you content periodically
- Index these pages
Merging these two workflows into a single, easily-managed Nutch
instance is left as an exercise for the reader. ;)
HTH,
--Matt
On Nov 27, 2007, at 3:30 PM, Christoph M. Pflügler wrote:
Hello,
I'm actually working on a project to create a little search-engine for
sites of different levels of trustworthiness.
Therefore I need a possibility to crawl a certain domain to a certain
depth (wich can be realized by urlfilter.txt, the seed domain(s)
and the
depth), but only links of other (external) domains should be available
to the search engine afterwards (and therefore only these ones
should be
indexed??).
If I add the original domain to the "exclude list" of the
urlfilter.txt,
of course nothing is going to be crawled/indexed.
What I basically want to achieve is to get all external sites whose
"distance" to the domain I'm crawling is "1 link". Right now my
plan is
to crawl the domain once (first index) and then crawl the resulting
sites of this first crawl with this "distance" thing described above
(second index). Finally, I wanted to search over these two indexes,
each
one representing a different "trust-level".
I must admit that I don't have much idea about IR, however I'm a quite
good Java programmer. I googled a lot, but I wasn't able to find
something useful. I was also looking around in the nutch API, but
classes like IndexingFilter don't seem to solve my problem.
I was also thinking about programming some plugin, but I don't really
have an idea about where to start of.
So if somebody has some idea how I could solve this, please let me
know!!
Thanks for your help!!
Chris
--
Matt Kangas / [EMAIL PROTECTED]