Nutch Crawl a Specific List Of URLs (150K)

Bin Wang Fri, 27 Dec 2013 10:50:43 -0800

Hi,

I have a very specific list of URLs, which is about 140K URLs.


I switch off the `db.update.additions.allowed` so it will not update the
crawldb... and I was assuming I can feed all the URLs to Nutch, and after
one round of fetching, it will finish and leave all the raw HTML files in
the segment folder.

However, after I run this command:
nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 &

It ended up with a small number of URLs..
TOTAL urls: 872
retry 0: 872
min score: 1.0
avg score: 1.0
max score: 1.0

And I double check the log to make sure that every url can pass the filter
and normalization. And here is the log:

2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of
urls rejected by filters: 0
2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of
urls injected after normalization and filtering: 139058
2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected
urls into crawl db.

I don't know how 140K URLs ended up being 872 in the end...

/usr/bin

----------------------
AWS ubuntu instance
Nutch 1.7
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1ubuntu0.12.04.4)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Nutch Crawl a Specific List Of URLs (150K)

Reply via email to