Hi Preetam,
-----Original Message----- From: Preetam Pradeepkumar Shingavi <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Sunday, February 8, 2015 at 10:18 AM To: "[email protected]" <[email protected]>, Chris Mattmann <[email protected]> Subject: Fetch queue size, Multiple seed URLs and Maximum Depth >Hi, > > >I have configured NUTCH with seed URL in local/url/seed.txt with just 1 >URL to test (depth=2) : > > >https://www.aoncadis.org/home.htm > > > >DOUBTS : >1. Fetch queue size : > > >Watching at the LOGs first time, while NUTCH crawls, it shows (see below >fetchQueues.totalSize=29 >which changes to something like 389 for same URL in the next run) : > > >2015-02-07 17:56:47,530 INFO fetcher.Fetcher - -activeThreads=50, >spinWaiting=50, >fetchQueues.totalSize=29, fetchQueues.getQueueCount=1 > > > >After the crawling is done and if I want to crawl the same seed url >again, ideally since the fetchqueue is now empty (I assume since first >run is done) should show the same fetchqueue.totalsize=29 as above but in >the next run it shows this fetch queue > size as 398 and its really time consuming to complete this queue. I’m not sure I understand your question. The fetch queue is never empty since it’s driven by the URL DB. So, if you have Urls that Nutch finishes its fetcher run (configured by numberOfRounds) that are still unfetched, it will be marked as such in the UrlDB and on the next iteration it will pick up where it left off on those URLs. > > >How do I avoid this ? Why would you want to? > > >2. Do I give multiple seed URLs in seed.txt, each on one line ? Yep. > > >3. What Max depth I can ask NUTCH to crawl. numberOfRounds controls this and you will have to experiment to determine the tradeoff here between depth and completeness. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

