Hi Chris,

>  Hi all.  First off, I'm using Nutch 0.72.
>
>  I've been playing with nutch for a couple weeks now, and have some
>questions relating to indexing blog sites.

[snip]

>  Third...  just in general... it seems I've had to goof with nutch's config
>enough to make this work in this way, that it makes me want to ask if using
>nutch for this purpose is indeed the correct path.  I know Technorati just
>directly uses lucene for a similar purpose.  Should that be the path I take
>(HTMLParser to fecth and extract text, lucene setup with incremental
>indexes)?

We've done something similar, in using Nutch to crawl code 
repositories. My advice would be to continue down your current path, 
as there's quite a lot in Nutch besides just the fetching support 
that proves useful when processing and serving up web-based content.

Eventually you might decide to just use Lucene and various pieces of 
Nutch as a better solution, but until then I think it's probably 
faster to use Nutch as your starting point, and also if/when that 
time comes, you'll have a much better understanding of how best to 
slice and dice.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to