Hi,
Is it possible to let Nutch crawl a set of documents at a time? I have set-up Nutch with the following option: topN 20 depth 2 Therefore I wanted Nutch to crawl my directory and just as deep as 2 links from the root directory. Now the root directory itself contains more than 20 files but my understanding of the topN is to make the crawler fetch 20 documents and then index. At the next crawl, the it chooses another 20 files from the directory and fetches and indexex them. My problem is that when Nutch crawls, it keeps on fetching the same files over and over again. That is a severe issue in my case because I have to run Nutch on some directory with more than 100 GB of data. It is more efficient to crawl a small set of files at a time to index than try to fetch all the data before indexing. Can you let me a workaround this? Or just let me know what I am doing wrong. Thanks in advance. Regards, Armel
