Re: Nutch Crawl a Specific List Of URLs (150K)

Talat Uyarer Sat, 28 Dec 2013 02:12:01 -0800

Hi Bin,

You have interesting error. I don't use 1.7 but I can try with screen
command. I believe you will not get same error.


Talat


2013/12/27 Bin Wang <[email protected]>

> Hi,
>
> I have a very specific list of URLs, which is about 140K URLs.
>
> I switch off the `db.update.additions.allowed` so it will not update the
> crawldb... and I was assuming I can feed all the URLs to Nutch, and after
> one round of fetching, it will finish and leave all the raw HTML files in
> the segment folder.
>
> However, after I run this command:
> nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 &
>
> It ended up with a small number of URLs..
> TOTAL urls: 872
> retry 0: 872
> min score: 1.0
> avg score: 1.0
> max score: 1.0
>
> And I double check the log to make sure that every url can pass the filter
> and normalization. And here is the log:
>
> 2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of
> urls rejected by filters: 0
> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of
> urls injected after normalization and filtering: 139058
> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected
> urls into crawl db.
>
> I don't know how 140K URLs ended up being 872 in the end...
>
> /usr/bin
>
> ----------------------
> AWS ubuntu instance
> Nutch 1.7
> java version "1.6.0_27"
> OpenJDK Runtime Environment (IcedTea6 1.12.6)
> (6b27-1.12.6-1ubuntu0.12.04.4)
> OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Re: Nutch Crawl a Specific List Of URLs (150K)

Reply via email to