Hi all. First off, I'm using Nutch 0.72. I've been playing with nutch for a couple weeks now, and have some questions relating to indexing blog sites.
Many blog platforms have a changes.xml file posted on some schedule ( blogger.com/changes10.xml is every 10 minutes), that list the blogs that have been updated in the last 10 minutes. Others have an atom stream... either way, the URLs you need to index are included, and there are -always- new URLs to crawl, and I know which ones are updated, and don't want to automatically recrawl them when I hit some time period (like crawl again in 30 days). Nutch seems to be designed to be given a few seed URLs, which it can inject into it's DB, crawl them, extract new links from those sites, and crawl those too... previously crawled sites will be recrawled again automatically once the time since last crawl hits some predefined number (30 days by default). ie: perfectly normal search engine behavior. For blogs... I want it to crawl the injected URLs, and none of the links on the page. I did this (I think!) by setting db.max.outlinks.per.page to zero. I want it to ONLY crawl the newly injected URLs (I did this by setting urlfilter.prefix.file to the name of my file that has the list of updated blog URLs). I not sure this setup will ensure that, when 30 days rolls around, nutch doesn't start automatically throwing old URLs into newly generated segments for a recrawl. For this test, I have this cycle: download changes10.xml, process with xsltproc to a plain text list of URLs. Inject into db... make sure urlfilter.prefix.file is set to the file with this list of URLs. Generate a new segment, fetch, and index. This results in a new segment every 10 minutes. Every 30 minutes I run 'merge' to merge the segment indexes into crawl/index. Now first... anyone see any problems with this setup? Second... I end up with perpetually growing list of segments, meaning the 'merge' run is taking longer and longer each time. How do I fix this? Third... just in general... it seems I've had to goof with nutch's config enough to make this work in this way, that it makes me want to ask if using nutch for this purpose is indeed the correct path. I know Technorati just directly uses lucene for a similar purpose. Should that be the path I take (HTMLParser to fecth and extract text, lucene setup with incremental indexes)? Thanks for any help anyone can provide. Chris -- Chris Newton, CTO Radian6, www.radian6.com Phone: 506-452-9039
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
