Hi Chris, > Hi all. First off, I'm using Nutch 0.72. > > I've been playing with nutch for a couple weeks now, and have some >questions relating to indexing blog sites.
[snip] > Third... just in general... it seems I've had to goof with nutch's config >enough to make this work in this way, that it makes me want to ask if using >nutch for this purpose is indeed the correct path. I know Technorati just >directly uses lucene for a similar purpose. Should that be the path I take >(HTMLParser to fecth and extract text, lucene setup with incremental >indexes)? We've done something similar, in using Nutch to crawl code repositories. My advice would be to continue down your current path, as there's quite a lot in Nutch besides just the fetching support that proves useful when processing and serving up web-based content. Eventually you might decide to just use Lucene and various pieces of Nutch as a better solution, but until then I think it's probably faster to use Nutch as your starting point, and also if/when that time comes, you'll have a much better understanding of how best to slice and dice. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers" ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
