: Nutch Crawl a Specific List Of URLs (150K)
Thanks for all the response, they are very inspiring and diving into the log
level is very beneficial to learn Nutch.
The fact is that I use Python BeautifulSoup to parse the sitemap of my targeted
website, which comes up with those 150K URLs, however, it
y it likely stopped when it did.
>
> Cheers
>
> -Original message-
> From: Bin Wang
> Sent: Friday 27th December 2013 19:50
> To: dev@nutch.apache.org
> Subject: Nutch Crawl a Specific List Of URLs (150K)
>
> Hi,
>
> I have a very sp
ll you how much it crawled and
why it likely stopped when it did.
Cheers
-Original message-
From: Bin Wang
Sent: Friday 27th December 2013 19:50
To: dev@nutch.apache.org
Subject: Nutch Crawl a Specific List Of URLs (150K)
Hi,
I have a very specific list of URLs, which is about 140
Hi Bin Wang,
>> nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 &
You were creating a new crawldb or reusing some old one ?
Were you running this on a cluster or in local mode ?
Was there any failure due to which the fetch round got aborted ? (see logs
for this).
I would like to rep
Hi Bin,
You have interesting error. I don't use 1.7 but I can try with screen
command. I believe you will not get same error.
Talat
2013/12/27 Bin Wang
> Hi,
>
> I have a very specific list of URLs, which is about 140K URLs.
>
> I switch off the `db.update.additions.allowed` so it will not up
Hi,
I have a very specific list of URLs, which is about 140K URLs.
I switch off the `db.update.additions.allowed` so it will not update the
crawldb... and I was assuming I can feed all the URLs to Nutch, and after
one round of fetching, it will finish and leave all the raw HTML files in
the segme
6 matches
Mail list logo