Hi, I have been established a way to crawl in NUTCH 0,9, but it does not work in NUTCH 1.0 anymore. Hope someone can shade some lights to this problem.
This is what I do. I have grouped my set of URLs into few groups and crawl them separately, so I can crawl them in different depths, filters, and schedules. Some groups of urls are all from the same site. After I am done with all groups, I copy all the segments together, do a crawldb update, which will create a new crawldb, and then index. This scheme worked well with nutch 0.9. But when I switch to nutch 1.0, search results will miss urls of certain segments all together. I have made sure that I am not filtering them out in any of the steps (crawldb update and index). Am I doing this totally wrong and just luck it worked in 0.9? Or something changed in 1.0? Thanks