Re: Fetch queue size, Multiple seed URLs and Maximum Depth

Preetam Pradeepkumar Shingavi Sun, 08 Feb 2015 11:13:46 -0800

Comments inline .

Thanks,
Preetam


On Sun, Feb 8, 2015 at 10:56 AM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> Hi Preetam,
>
>
> -----Original Message-----
> From: Preetam Pradeepkumar Shingavi <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Sunday, February 8, 2015 at 10:18 AM
> To: "[email protected]" <[email protected]>, Chris Mattmann
> <[email protected]>
> Subject: Fetch queue size, Multiple seed URLs and Maximum Depth
>
> >Hi,
> >
> >
> >I have configured NUTCH with seed URL in local/url/seed.txt with just 1
> >URL to test (depth=2) :
> >
> >
> >https://www.aoncadis.org/home.htm
> >
> >
> >
> >DOUBTS :
> >1. Fetch queue size :
> >
> >
> >Watching at the LOGs first time, while NUTCH crawls, it shows (see below
> >fetchQueues.totalSize=29
> >which changes to something like 389 for same URL in the next run) :
> >
> >
> >2015-02-07 17:56:47,530 INFO  fetcher.Fetcher - -activeThreads=50,
> >spinWaiting=50,
> >fetchQueues.totalSize=29, fetchQueues.getQueueCount=1
> >
> >
> >
> >After the crawling is done and if I want to crawl the same seed url
> >again, ideally since the fetchqueue is now empty (I assume since first
> >run is done) should show the same fetchqueue.totalsize=29 as above but in
> >the next run it shows this fetch queue
> > size as 398 and its really time consuming to complete this queue.
>
> I’m not sure I understand your question. The fetch queue is never empty
> since
> it’s driven by the URL DB. So, if you have Urls that Nutch finishes its
> fetcher
> run (configured by numberOfRounds) that are still unfetched, it will be
> marked
> as such in the UrlDB and on the next iteration it will pick up where it
> left
> off on those URLs.
>


> >
> >
> >How do I avoid this ?
>
> Why would you want to?
>

*Preetam : Just was curious to manually handle this if possible.*
I was anticipating that once the db has been fetched and CrawlDB has all
the URLs crawled data to depth 2, the next run should not crawl the same
URLs again.
Is it that URLs fetched at depth 2 which are kept unfetched in the queue
(not deque'd since  it has crawled the threshold depth passed) due to the
depth value constraint and are hence fetched in the next run resulting in
increase in fetch size ?


> >
> >
> >2. Do I give multiple seed URLs in seed.txt, each on one line ?
>
> Yep.
> >
> >
> >3. What Max depth I can ask NUTCH to crawl.
>
> numberOfRounds controls this and you will have to experiment to determine
> the tradeoff here between depth and completeness.
>

*Preetam : Okay cool.*

>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>

Re: Fetch queue size, Multiple seed URLs and Maximum Depth

Reply via email to