Problems with indexing sub-section of a site

foobar3001 Thu, 22 May 2008 19:46:50 -0700

Hello!

In short:

Is it possible to tell Nutch to follow the links through one larger name
space, but only index (add to its database) the content of links that are in
a sub-name space of that?

The background:

I have started to experiment with crawling my blog with Nutch. The problem
is that this blog doesn't have its own domain. Instead, it it is hosted on a
larger site, which also hosts discussion forums and other people's blogs.

My URL there is "http://www.geekzone.co.nz/foobar";, so naturally I thought
that adding something in the crawl-urlfilter.txt file would help. Something
like this:

+^http://([a-z0-9]*\.)*geekzone.co.nz/foobar

But look at the bottom of that page: The navigation links to the other pages
in my blog - or to 'next' page - actually lead out of my namespace. Thus,
they are not being picked up anymore and Nutch never sees the additional
links that I have on those other pages.

Since eventually I would like this to be a bit more generic (I don't want
anything specific for my blog, that's just a test case), I thought that
maybe I have to open it up to the root URL, making the filter something like
this:

+^http://([a-z0-9]*\.)*geekzone.co.nz

But then it picks up a ton of other stuff that I am not interested to have
in my database.

So, now I'm wondering whether it is possible to tell Nutch to follow links
through one namespace, but only add those pages into its index database that
are in a specific sub-namespace of the first one?

Thank you very much...
--
View this message in context:
http://www.nabble.com/Problems-with-indexing-sub-section-of-a-site-tp17417650p17417650.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Problems with indexing sub-section of a site

Reply via email to