Fetch queue size, Multiple seed URLs and Maximum Depth

Preetam Pradeepkumar Shingavi Sun, 08 Feb 2015 10:20:02 -0800

Hi,

I have configured NUTCH with seed URL in local/url/seed.txt with just 1 URL
to test (depth=2) :


https://www.aoncadis.org/home.htm

DOUBTS :
1. Fetch queue size :

Watching at the LOGs first time, while NUTCH crawls, it shows (see
below *fetchQueues.totalSize=29
*which changes to something like 389 for same URL in the next run) :

2015-02-07 17:56:47,530 INFO  fetcher.Fetcher - -activeThreads=50,
spinWaiting=50, *fetchQueues.totalSize=29*, fetchQueues.getQueueCount=1

After the crawling is done and if I want to crawl the same seed url again,
ideally since the fetchqueue is now empty (I assume since first run is
done) should show the same fetchqueue.totalsize=29 as above but in the next
run it shows this fetch queue size as 398 and its really time consuming to
complete this queue.

How do I avoid this ?

2. Do I give multiple seed URLs in seed.txt, each on one line ?

3. What Max depth I can ask NUTCH to crawl.

Thanks,
Preetam

Fetch queue size, Multiple seed URLs and Maximum Depth

Reply via email to