Hello! In short:
Is it possible to tell Nutch to follow the links through one larger name space, but only index (add to its database) the content of links that are in a sub-name space of that? The background: I have started to experiment with crawling my blog with Nutch. The problem is that this blog doesn't have its own domain. Instead, it it is hosted on a larger site, which also hosts discussion forums and other people's blogs. My URL there is "http://www.geekzone.co.nz/foobar", so naturally I thought that adding something in the crawl-urlfilter.txt file would help. Something like this: +^http://([a-z0-9]*\.)*geekzone.co.nz/foobar But look at the bottom of that page: The navigation links to the other pages in my blog - or to 'next' page - actually lead out of my namespace. Thus, they are not being picked up anymore and Nutch never sees the additional links that I have on those other pages. Since eventually I would like this to be a bit more generic (I don't want anything specific for my blog, that's just a test case), I thought that maybe I have to open it up to the root URL, making the filter something like this: +^http://([a-z0-9]*\.)*geekzone.co.nz But then it picks up a ton of other stuff that I am not interested to have in my database. So, now I'm wondering whether it is possible to tell Nutch to follow links through one namespace, but only add those pages into its index database that are in a specific sub-namespace of the first one? Thank you very much... -- View this message in context: http://www.nabble.com/Problems-with-indexing-sub-section-of-a-site-tp17417650p17417650.html Sent from the Nutch - User mailing list archive at Nabble.com.