Hi All,
        I have been using nutch for a while now with a 300M+ urls  crawled in 
the 
last few months. So we have a lot of segments of mixed new data and recrawl 
data. While it is assumed safe to delete segments older than the refresh 
rate, it is not certain that all the urls in the old segments have been 
recrawled given the sheer number of urls that are in the database.
Some of the older segments also contain the top level homepages of many of the 
domains so I'd like to be sure that these are refreshed in another newer 
segment. 

Is anybody tackling this problem? 
If not, I have been thinking of building the following tool:

1) Read a segment or a collection of segments.
2) Compare each url to its entry in webdb:
        If the url was marked with -adddays from a previous fetchlist 
generation, 
ignore it.
        If the url was not accessible at last crawl, add it to list.
        If the url was last crawled with a 200 or 30? status longer than 
refresh rate 
days ago, add to list. (Not sure about 303 pages,  )    
3) sort the list and run it against the url filters.
4) generate a fetchlist with these urls. If there are no urls in this list, 
then this segment is ready for deletion. 
This part can be just a report on the state of the segments or can generate 
the segment without marking it in webdb.

I'd welcome any comment. 
If others find it useful, I will be happy to post it once it's done.

Phoebe.

Reply via email to