Re: setting modifiedTime in DefaultFetchSchedule

Cesare Zavattari Tue, 20 Nov 2012 01:49:33 -0800

On Tue, Nov 20, 2012 at 12:12 AM, Sebastian Nagel <
[email protected]> wrote:


> Hi Cesare,
>

Ciao Sebastian and thanks for your email.


> > modifiedTime = fetchTime;
> > instead of:
> > if (modifiedTime <= 0) modifiedTime = fetchTime;
> This will always overwrite modified time with the time the fetch took
> place.
> I would prefer the way as it's done in AdaptiveFetchSchedule:
> only set modifiedTime if it's unset (=0).
>

here's my problem:

- you fetch a page XXX the first time
- modifiedTime is 0, so it's set to fetchTime
- from now on I'll get 304...
- ... unless XXX changes
- modifiedTime will never be changed and I'll never get 304 again, page
will be always fetched (200) because If-Modified-Since will always be true

this is why I always set modifiedTime. We could skip it if status is
NOTMODIFIED.

The same issue seems to affect AdaptiveFetchSchedule

> I don't know if this is correct (probably not) but at least 304 seems to
> be
> > handled. In particular, in the protocol-file (File.getProtocolOutput)
> I've
> > added a special case for 304:
> >
> > if (code == 304) { // got a not modified response
> >     return new ProtocolOutput(response.toContent(),
> >       ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
> >         }
> >
> > I suppose this is NOT the right solution :-)
> At a first glance, it's not bad. Protocol-file needs obviously a revision:
> the 304 is set properly in FileResponse.java but in File.java it is
> treated as
> redirect:
>    else if (code >= 300 && code < 400) { // handle redirect
> So, thanks. Good catch!
>
> Would be great if you could open Jira issues for
> - setting modified time in DefaultSchedule
> - 304 handling in protocol-file
> If you can provide patches, even better. Thanks!
>

I want to be sure about the right solution for setting modifiedTime
properly.

About your problem with removal / re-adding files:
> - a file system is crawled as if linked web pages:
>   a directory is just an HTML page with all files and sub-directories
>   as links.
>

this is clear. Let's consider a page A that links a page B:

A -> B

A is seed. I use the following command:

./nutch crawl urls -depth 2 -topN 5

we crawl it. Ok.
Now let's remove page B.

./nutch crawl urls -depth 2 -topN 5

B gets a 404. Fine.

now let's restore B and crawl again.

This works as expected if A and B are html pages (B is fetched by "./nutch
crawl"). If A is a directory and B is a file, B will never be fetched
again. Moreover, in this case A get a 200 because a new file is added, so
the pasing/generate phases should force the refetch of B, isn't it?

Reproducing it is easy:

mkdir /tmp/files/
echo "AAA" >/tmp/files/aa.txt

the only seed is file://localhost/tmp/files/

./nutch crawl urls -depth 2 -topN 5    // both /tmp/files/ and
/tmp/files/aa.txt are get
rm /tmp/files/aa.txt
./nutch crawl urls -depth 2 -topN 5    // /tmp/files/aa.txt gets a 404
echo "AAA" >/tmp/files/aa.txt
./nutch crawl urls -depth 2 -topN 5    // /tmp/files/ has changed, is get
(200) while aa.txt:

ParserJob: parsing all
Parsing file://localhost/tmp/files/
Skipping file://localhost/tmp/files/aa.txt; different batch id (null)

and is never fetched again, despite the page that links it (the directory)
has changed.

is this the expected behavior?

thanks a lot

-- 
Cesare

Re: setting modifiedTime in DefaultFetchSchedule

Reply via email to