This has all probably been hashed out ad nauseam, but I haven't seen an end-to-end howto on what I am trying to do. If I can get all the kinks worked out (and understand all the pieces), I'll be glad to write one.

I have a domain that has several hundred thousand documents. I would like to: * Setup an initial index and db using the crawl tool (to some reasonable depth) to get me started
  * Hookup the NutchBean to actually do the searches
* Continually crawl the 'next 1000 (or so) links' daily to go 'deeper' into the site. Refresh the index after each of these incremental searches
  * Keep the pages fresh (no more than 15 days old)
  * Remove pages when they disappear from the server
  * Use a finite amount of resources

Here is what I have so far:
  * nutch crawl myurls -dir myindex -depth 5
       This creates 5 segments with:
          Number of pages: 32509
          Number of links: 545061
I assume this this means that I have fetched and indexed 32509 pages and found 545061 links in the process (does this mean that I have 512552 pages to go?)

  * Setup the NutchBean to serve searches

  * Change db.default.fetch.interval=15

* Daily, create a new segment, index it, dedup, and merge it into the main index
      # Grab some pages and update the database
      nutch generate index/db index/segments -topN  1000
      s1=`ls -d index/segments/2* | tail -1`
      nutch fetch $s1
      nutch updatedb index/db $s1

      #update the database with the segments
      nutch updatesegs index/db index/segments index/workdir

      #index for the additional segment we added
      nutch index $s1 -dir index/workdir

      #delete duplicate content
      nutch dedup index/workdir index/segments

      #merge all segments into the master index
      nutch merge -workingdir index/workdir index/index index/segments/2*

rm -rf ${dir}/index/workdir
  * Tell the search server that the index changed ('reload' the NutchBean)


This all seems to work well and I happily do this for 15 days.
<time passes>

Now as I understand it nutch will see that a page is older than 15 days and will refetch it and put in in one of my new segments. The old segment is ignored and the page inside the new segment will be used.

Finally, my questions:
* I now have over 600k links and 40k pages in my database. How can I get nutch to fetch existing content (make sure its fresh) instead of fetching new content? Is there a deterministic approach nutch takes (or a way to influence it)? * Is there any way to know when I can safely delete a segment? That is how can I make sure all the pages in an old segment have been fetched in a subsequent one? * I see some mention of inverting links in the Internet crawl. This isn't done in the 0.7.1 crawltool (which I used to develop my incremental updates). Why would I want/need to do this in my situation (a single site crawl)? * Is there anything fundamentally wrong (or even screwy) with a setup like this? Are my assumptions correct? I realize that with these numbers I will never 'catch up' with the initial crawl if all I am doing is refreshing content (I guess I can do another 'big' segment each week, or something)



Sorry for the long post, and thanks in advance!
Steven





-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to