I have a small question about Nutch 2.X source code, i hope this is the right 
mailing list for
that. i was unable to locate the following pieces from the code:

a) where does the linkdb get generated, which java file contains the code for 
that

b) i see the WebPage class being utilized for remembering the pages that were
  gathered. It looks like the crawldb is a repository of these pages. If that is
  the case then:

  -- it looks like WepPage remembers the contents of the page together with the
    rest of the information about the page. How do we delete content which is
    old and not changed for a while

-- it does not appear that Nutch 2.X has any concept of segments. How do we
    delete stuff that is older than 1 month so that we dont blow out the disk 
space ?
   It seemed that Nutch 1.x had segments, and older segments were removable

thanks

Reply via email to