Hi Chris,

 Hi all.  First off, I'm using Nutch 0.72.

 I've been playing with nutch for a couple weeks now, and have some
questions relating to indexing blog sites.

[snip]

 Third...  just in general... it seems I've had to goof with nutch's config
enough to make this work in this way, that it makes me want to ask if using
nutch for this purpose is indeed the correct path.  I know Technorati just
directly uses lucene for a similar purpose.  Should that be the path I take
(HTMLParser to fecth and extract text, lucene setup with incremental
indexes)?

We've done something similar, in using Nutch to crawl code repositories. My advice would be to continue down your current path, as there's quite a lot in Nutch besides just the fetching support that proves useful when processing and serving up web-based content.

Eventually you might decide to just use Lucene and various pieces of Nutch as a better solution, but until then I think it's probably faster to use Nutch as your starting point, and also if/when that time comes, you'll have a much better understanding of how best to slice and dice.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to