RE: Nutch Crawl a Specific List Of URLs (150K)

Markus Jelsma Mon, 30 Dec 2013 03:14:35 -0800

Hi, 

You ran one crawl cycle. Depending on the generator and fetcher settings you 
are not guaranteerd to fetch 200.000 URL's with only topN specified. Check the 
logs, the generator will tell you if there are too many URL's for a host or 
domain. Also check all fetcher logs, it will tell you how much it crawled and 
why it likely stopped when it did.

Cheers

-----Original message-----
From: Bin Wang<[email protected]>
Sent: Friday 27th December 2013 19:50
To: [email protected]
Subject: Nutch Crawl a Specific List Of URLs (150K)

Hi,

I have a very specific list of URLs, which is about 140K URLs.

I switch off the `db.update.additions.allowed` so it will not update the 
crawldb... and I was assuming I can feed all the URLs to Nutch, and after one 
round of fetching, it will finish and leave all the raw HTML files in the 
segment folder.

However, after I run this command:

nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 &

It ended up with a small number of URLs..

TOTAL urls:     872

retry 0:        872

min score:      1.0

avg score:      1.0

max score:      1.0

And I double check the log to make sure that every url can pass the filter and 
normalization. And here is the log:

2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of urls 
rejected by filters: 0

2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of urls 
injected after normalization and filtering: 139058

2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected urls 
into crawl db.

I dont know how 140K URLs ended up being 872 in the end...

/usr/bin

----------------------

AWS ubuntu instance

Nutch 1.7

java version "1.6.0_27"

OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1ubuntu0.12.04.4)

OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

RE: Nutch Crawl a Specific List Of URLs (150K)

Reply via email to