Hi Chris,
Hi all. First off, I'm using Nutch 0.72.
I've been playing with nutch for a couple weeks now, and have some
questions relating to indexing blog sites.
[snip]
Third... just in general... it seems I've had to goof with nutch's config
enough to make this work in this way, that it makes me want to ask if using
nutch for this purpose is indeed the correct path. I know Technorati just
directly uses lucene for a similar purpose. Should that be the path I take
(HTMLParser to fecth and extract text, lucene setup with incremental
indexes)?
We've done something similar, in using Nutch to crawl code
repositories. My advice would be to continue down your current path,
as there's quite a lot in Nutch besides just the fetching support
that proves useful when processing and serving up web-based content.
Eventually you might decide to just use Lucene and various pieces of
Nutch as a better solution, but until then I think it's probably
faster to use Nutch as your starting point, and also if/when that
time comes, you'll have a much better understanding of how best to
slice and dice.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"