Hi Preetam,

-----Original Message-----
From: Preetam Pradeepkumar Shingavi <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Sunday, February 8, 2015 at 10:18 AM
To: "[email protected]" <[email protected]>, Chris Mattmann
<[email protected]>
Subject: Fetch queue size, Multiple seed URLs and Maximum Depth

>Hi,
>
>
>I have configured NUTCH with seed URL in local/url/seed.txt with just 1
>URL to test (depth=2) :
>
>
>https://www.aoncadis.org/home.htm
>
>
>
>DOUBTS :
>1. Fetch queue size :
>
>
>Watching at the LOGs first time, while NUTCH crawls, it shows (see below
>fetchQueues.totalSize=29
>which changes to something like 389 for same URL in the next run) :
>
>
>2015-02-07 17:56:47,530 INFO  fetcher.Fetcher - -activeThreads=50,
>spinWaiting=50,
>fetchQueues.totalSize=29, fetchQueues.getQueueCount=1
>
>
>
>After the crawling is done and if I want to crawl the same seed url
>again, ideally since the fetchqueue is now empty (I assume since first
>run is done) should show the same fetchqueue.totalsize=29 as above but in
>the next run it shows this fetch queue
> size as 398 and its really time consuming to complete this queue.

I’m not sure I understand your question. The fetch queue is never empty
since
it’s driven by the URL DB. So, if you have Urls that Nutch finishes its
fetcher
run (configured by numberOfRounds) that are still unfetched, it will be
marked
as such in the UrlDB and on the next iteration it will pick up where it
left
off on those URLs.

>
>
>How do I avoid this ?

Why would you want to?

>
>
>2. Do I give multiple seed URLs in seed.txt, each on one line ?

Yep.

>
>
>3. What Max depth I can ask NUTCH to crawl.

numberOfRounds controls this and you will have to experiment to determine
the tradeoff here between depth and completeness.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Reply via email to