Hi, I have configured NUTCH with seed URL in local/url/seed.txt with just 1 URL to test (depth=2) :
https://www.aoncadis.org/home.htm DOUBTS : 1. Fetch queue size : Watching at the LOGs first time, while NUTCH crawls, it shows (see below *fetchQueues.totalSize=29 *which changes to something like 389 for same URL in the next run) : 2015-02-07 17:56:47,530 INFO fetcher.Fetcher - -activeThreads=50, spinWaiting=50, *fetchQueues.totalSize=29*, fetchQueues.getQueueCount=1 After the crawling is done and if I want to crawl the same seed url again, ideally since the fetchqueue is now empty (I assume since first run is done) should show the same fetchqueue.totalsize=29 as above but in the next run it shows this fetch queue size as 398 and its really time consuming to complete this queue. How do I avoid this ? 2. Do I give multiple seed URLs in seed.txt, each on one line ? 3. What Max depth I can ask NUTCH to crawl. Thanks, Preetam

