Nutch 2.3.1 re-crawls unchanged web pages

Vladimir Loubenski Thu, 24 Nov 2016 12:10:59 -0800

Hi ,
I am using Nutch 2.3.1.
I run in loop generate, fetch, parse, updateDB steps. 
I noted that during re-crawl even if a  web page doesn't change nutch doesn't 
detect it  by value of  ETag, Last-Modified or signature fields and continue 
process all these steps for unchanged web pages.
 Is it expected behaviour?
Are there plans to fix it in future releases?

Regards,
Vladimir.

-----Original Message-----
From: Jim Lamb [mailto:[email protected]] 
Sent: November-22-16 6:22 AM
To: [email protected]
Subject: Re: Automating Nutch 2.3.1 on Amazon EMR

Further to this, I have found that I can only submit a maximum of 256 steps to 
EMR. Some of our crawls take over 100 rounds, so defining an arbitrary number 
of (generate,fetch,parse,updatedb,index,solrdedup) rounds each with 6 steps 
isn't going to work either :-(

Has nobody automated this?

Thanks,

Jim

Sent: Thursday, November 17, 2016 at 11:30 AM
From: "Jim Lamb" <[email protected]>
To: [email protected]
Subject: Re: Automating Nutch 2.3.1 on Amazon EMR Hi Sebastian,

Thanks for coming back to me.

> Adding
> set -x
> to bin/nutch and then running bin/crawl with a sample crawl which 
> includes all steps should log all commands with a full list of arguments.

Yes, that's a great idea. Thanks.

> But on EMR it should be possible to directly reference the Nutch job 
> file by a s3:// URL. (but haven't tried it this way)

Yes, that is possible. You add an S3 URL to the Jar= argument in your step 
definition of the create-cluster command.

> aws emr terminate-cluster ...

Ah, yes. I did wonder if the master instance had appropriate instance role 
privilege to do this. I'll try.

Unfortunately, it still doesn't solve the iteration issue. Short of defining 
many many repeated sets of steps, I don't see how I would get multiple rounds. 
What am I missing?

Thanks,

Jim

Nutch 2.3.1 re-crawls unchanged web pages

Reply via email to