Hello,
I'm using Nutch 0.9 's jar to programming in Java to make crawls with
a predefined depth and I am having a problem when trying to recrawl,
and I don't know if I am solving it in the right way:
In the first crawl i have no problems, but when I recrawl in my crawl
database there are pages and links from the previous operation, so if
I first crawl with depth 1 and later I recrawl with depth 1 again is
like a depth 2 crawling. From an example:
I make a depth-1 crawling on www.fgfgfgfgfgfgf.com ; it recovers
information from that page and in that information there is a link to
www.vbvbvbvbvbvbvbvb.com. When I recrawl with depth 1 again it
recrawls from the first web and from the second one, that was added in
the first crawl. So this is like I made a depth-2 crawling on the
first web, not a depth-1 recrawling.
To solve this when I recrawl I make a temporal crawl database
beginning with my seed URL at the desired depth, and run the crawling
on that one, and then I update my original database with the
information fetched with that temporal recrawl. To make it clear:
Recrawl:
inject URL on <temp database>
cycle:
generate from <temp database>
fetch from segment generated
update both databases: <original database> and <temp database>
with
fetched information
end cycle
generate index from <original database> information , delete
duplicates and merge with old index
delete <temp database>
I would like to know if there is a better way to do recrawling
(without making a temp database, making something like removing links
from my database so it uses only seed URL in next recrawl. I didn't
find the way in Nutch 0.9 API) and if the way I solve the problem is
correct or has some bug that I will regret when the application is
almost done.
Thank you for reading!