Hello, I'm actually working on a project to create a little search-engine for sites of different levels of trustworthiness.
Therefore I need a possibility to crawl a certain domain to a certain depth (wich can be realized by urlfilter.txt, the seed domain(s) and the depth), but only links of other (external) domains should be available to the search engine afterwards (and therefore only these ones should be indexed??). If I add the original domain to the "exclude list" of the urlfilter.txt, of course nothing is going to be crawled/indexed. What I basically want to achieve is to get all external sites whose "distance" to the domain I'm crawling is "1 link". Right now my plan is to crawl the domain once (first index) and then crawl the resulting sites of this first crawl with this "distance" thing described above (second index). Finally, I wanted to search over these two indexes, each one representing a different "trust-level". I must admit that I don't have much idea about IR, however I'm a quite good Java programmer. I googled a lot, but I wasn't able to find something useful. I was also looking around in the nutch API, but classes like IndexingFilter don't seem to solve my problem. I was also thinking about programming some plugin, but I don't really have an idea about where to start of. So if somebody has some idea how I could solve this, please let me know!! Thanks for your help!! Chris
signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil
