How to get the crawl database free of links to recrawl only from seed URL?

Ismael Fri, 24 Aug 2007 14:10:58 -0700

Hello,

I'm using Nutch 0.9 's jar to programming in Java to make crawls with
a predefined depth and I am having a problem when trying to recrawl,
and I don't know if I am solving it in the right way:


In the first crawl i have no problems, but when I recrawl in my crawl
database there are pages and links from the previous operation, so if
I first crawl with depth 1 and later I recrawl with depth 1 again is
like a depth 2 crawling. From an example:

I make a depth-1 crawling on www.fgfgfgfgfgfgf.com ; it recovers
information from that page and in that information there is a link to
www.vbvbvbvbvbvbvbvb.com.   When I recrawl with depth 1 again it
recrawls from the first web and from the second one, that was added in
the first crawl. So this is like I made a depth-2 crawling on the
first web, not a depth-1 recrawling.

To solve this when I recrawl I make a temporal crawl database
beginning with my seed URL at the desired depth, and run the crawling
on that one, and then I update my original database with the
information fetched with that temporal recrawl. To make it clear:
Recrawl:
        inject URL on <temp database>
        cycle:
                generate from <temp database>
                fetch from segment generated
                update both databases: <original database> and <temp database> 
with
fetched information
        end cycle
        generate index from <original database> information , delete
duplicates and merge with old index
        delete <temp database>


I would like to know if there is a better way to do recrawling
(without making a temp database, making something like removing links
from my database so it uses only seed URL in next recrawl. I didn't
find the way in Nutch 0.9 API) and if the way I solve the problem is
correct or has some bug that I will regret when the application is
almost done.

Thank you for reading!

How to get the crawl database free of links to recrawl only from seed URL?

Reply via email to