Re: Nutch Crawl a Specific List Of URLs (150K)

Tejas Patil Sun, 29 Dec 2013 22:47:55 -0800

Hi Bin Wang,

>> nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 &
You were creating a new crawldb or reusing some old one ?


Were you running this on a cluster or in local mode ?
Was there any failure due to which the fetch round got aborted ? (see logs
for this).

I would like to reproduce this issue. Will it be possible for you to share
your config files and subset of urls ?

Thanks,
Tejas


On Sat, Dec 28, 2013 at 2:10 AM, Talat Uyarer <[email protected]> wrote:

> Hi Bin,
>
> You have interesting error. I don't use 1.7 but I can try with screen
> command. I believe you will not get same error.
>
> Talat
>
>
> 2013/12/27 Bin Wang <[email protected]>
>
>> Hi,
>>
>> I have a very specific list of URLs, which is about 140K URLs.
>>
>> I switch off the `db.update.additions.allowed` so it will not update the
>> crawldb... and I was assuming I can feed all the URLs to Nutch, and after
>> one round of fetching, it will finish and leave all the raw HTML files in
>> the segment folder.
>>
>> However, after I run this command:
>> nohup bin/nutch crawl urls -dir result -depth 1 -topN 200000 &
>>
>> It ended up with a small number of URLs..
>> TOTAL urls: 872
>> retry 0: 872
>> min score: 1.0
>> avg score: 1.0
>> max score: 1.0
>>
>> And I double check the log to make sure that every url can pass the
>> filter and normalization. And here is the log:
>>
>> 2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of
>> urls rejected by filters: 0
>> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of
>> urls injected after normalization and filtering: 139058
>> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected
>> urls into crawl db.
>>
>> I don't know how 140K URLs ended up being 872 in the end...
>>
>> /usr/bin
>>
>> ----------------------
>> AWS ubuntu instance
>> Nutch 1.7
>> java version "1.6.0_27"
>> OpenJDK Runtime Environment (IcedTea6 1.12.6)
>> (6b27-1.12.6-1ubuntu0.12.04.4)
>> OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
>>
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>

Re: Nutch Crawl a Specific List Of URLs (150K)

Reply via email to