[Nutch-general] Incremental search of a single domain

Steven Yelton Fri, 24 Feb 2006 09:31:07 -0800

This has all probably been hashed out ad nauseam, but I haven't seen anend-to-end howto on what I am trying to do. If I can get all the kinksworked out (and understand all the pieces), I'll be glad to write one.

I have a domain that has several hundred thousand documents. I wouldlike to:* Setup an initial index and db using the crawl tool (to somereasonable depth) to get me started

  * Hookup the NutchBean to actually do the searches

* Continually crawl the 'next 1000 (or so) links' daily to go'deeper' into the site. Refresh the index after each of theseincremental searches

  * Keep the pages fresh (no more than 15 days old)
  * Remove pages when they disappear from the server
  * Use a finite amount of resources


Here is what I have so far:
  * nutch crawl myurls -dir myindex -depth 5
       This creates 5 segments with:
          Number of pages: 32509
          Number of links: 545061

I assume this this means that I have fetched and indexed 32509pages and found 545061 links in the process (does this mean that I have512552 pages to go?)


  * Setup the NutchBean to serve searches

  * Change db.default.fetch.interval=15

* Daily, create a new segment, index it, dedup, and merge it into themain index

      # Grab some pages and update the database
      nutch generate index/db index/segments -topN  1000
      s1=`ls -d index/segments/2* | tail -1`
      nutch fetch $s1
      nutch updatedb index/db $s1

      #update the database with the segments
      nutch updatesegs index/db index/segments index/workdir

      #index for the additional segment we added
      nutch index $s1 -dir index/workdir

      #delete duplicate content
      nutch dedup index/workdir index/segments

      #merge all segments into the master index
      nutch merge -workingdir index/workdir index/index index/segments/2*

rm -rf ${dir}/index/workdir

  * Tell the search server that the index changed ('reload' the NutchBean)

This all seems to work well and I happily do this for 15 days.

<time passes>

Now as I understand it nutch will see that a page is older than 15 daysand will refetch it and put in in one of my new segments. The oldsegment is ignored and the page inside the new segment will be used.


Finally, my questions:

* I now have over 600k links and 40k pages in my database. How can Iget nutch to fetch existing content (make sure its fresh) instead offetching new content? Is there a deterministic approach nutch takes (ora way to influence it)?* Is there any way to know when I can safely delete a segment? Thatis how can I make sure all the pages in an old segment have been fetchedin a subsequent one?* I see some mention of inverting links in the Internet crawl. Thisisn't done in the 0.7.1 crawltool (which I used to develop myincremental updates). Why would I want/need to do this in my situation(a single site crawl)?* Is there anything fundamentally wrong (or even screwy) with a setuplike this? Are my assumptions correct? I realize that with thesenumbers I will never 'catch up' with the initial crawl if all I am doingis refreshing content (I guess I can do another 'big' segment each week,or something)




Sorry for the long post, and thanks in advance!
Steven





-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Incremental search of a single domain

Reply via email to