URL-Filter for ?indexing??

Christoph M. Tue, 27 Nov 2007 12:31:05 -0800

Hello,

I'm actually working on a project to create a little search-engine for
sites of different levels of trustworthiness.


Therefore I need a possibility to crawl a certain domain to a certain
depth (wich can be realized by urlfilter.txt, the seed domain(s) and the
depth), but only links of other (external) domains should be available
to the search engine afterwards (and therefore only these ones should be
indexed??).

If I add the original domain to the "exclude list" of the urlfilter.txt,
of course nothing is going to be crawled/indexed.

What I basically want to achieve is to get all external sites whose
"distance" to the domain I'm crawling is "1 link". Right now my plan is
to crawl the domain once (first index) and then crawl the resulting
sites of this first crawl with this "distance" thing described above
(second index). Finally, I wanted to search over these two indexes, each
one representing a different "trust-level". 

I must admit that I don't have much idea about IR, however I'm a quite
good Java programmer. I googled a lot, but I wasn't able to find
something useful. I was also looking around in the nutch API, but
classes like IndexingFilter don't seem to solve my problem.

I was also thinking about programming some plugin, but I don't really
have an idea about where to start of.

So if somebody has some idea how I could solve this, please let me
know!!

Thanks for your help!!

Chris

signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil

URL-Filter for ?indexing??

Reply via email to