On Tue, Nov 20, 2012 at 12:12 AM, Sebastian Nagel <
[email protected]> wrote:
> Hi Cesare,
>
Ciao Sebastian and thanks for your email.
> > modifiedTime = fetchTime;
> > instead of:
> > if (modifiedTime <= 0) modifiedTime = fetchTime;
> This will always overwrite modified time with the time the fetch took
> place.
> I would prefer the way as it's done in AdaptiveFetchSchedule:
> only set modifiedTime if it's unset (=0).
>
here's my problem:
- you fetch a page XXX the first time
- modifiedTime is 0, so it's set to fetchTime
- from now on I'll get 304...
- ... unless XXX changes
- modifiedTime will never be changed and I'll never get 304 again, page
will be always fetched (200) because If-Modified-Since will always be true
this is why I always set modifiedTime. We could skip it if status is
NOTMODIFIED.
The same issue seems to affect AdaptiveFetchSchedule
> I don't know if this is correct (probably not) but at least 304 seems to
> be
> > handled. In particular, in the protocol-file (File.getProtocolOutput)
> I've
> > added a special case for 304:
> >
> > if (code == 304) { // got a not modified response
> > return new ProtocolOutput(response.toContent(),
> > ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
> > }
> >
> > I suppose this is NOT the right solution :-)
> At a first glance, it's not bad. Protocol-file needs obviously a revision:
> the 304 is set properly in FileResponse.java but in File.java it is
> treated as
> redirect:
> else if (code >= 300 && code < 400) { // handle redirect
> So, thanks. Good catch!
>
> Would be great if you could open Jira issues for
> - setting modified time in DefaultSchedule
> - 304 handling in protocol-file
> If you can provide patches, even better. Thanks!
>
I want to be sure about the right solution for setting modifiedTime
properly.
About your problem with removal / re-adding files:
> - a file system is crawled as if linked web pages:
> a directory is just an HTML page with all files and sub-directories
> as links.
>
this is clear. Let's consider a page A that links a page B:
A -> B
A is seed. I use the following command:
./nutch crawl urls -depth 2 -topN 5
we crawl it. Ok.
Now let's remove page B.
./nutch crawl urls -depth 2 -topN 5
B gets a 404. Fine.
now let's restore B and crawl again.
This works as expected if A and B are html pages (B is fetched by "./nutch
crawl"). If A is a directory and B is a file, B will never be fetched
again. Moreover, in this case A get a 200 because a new file is added, so
the pasing/generate phases should force the refetch of B, isn't it?
Reproducing it is easy:
mkdir /tmp/files/
echo "AAA" >/tmp/files/aa.txt
the only seed is file://localhost/tmp/files/
./nutch crawl urls -depth 2 -topN 5 // both /tmp/files/ and
/tmp/files/aa.txt are get
rm /tmp/files/aa.txt
./nutch crawl urls -depth 2 -topN 5 // /tmp/files/aa.txt gets a 404
echo "AAA" >/tmp/files/aa.txt
./nutch crawl urls -depth 2 -topN 5 // /tmp/files/ has changed, is get
(200) while aa.txt:
ParserJob: parsing all
Parsing file://localhost/tmp/files/
Skipping file://localhost/tmp/files/aa.txt; different batch id (null)
and is never fetched again, despite the page that links it (the directory)
has changed.
is this the expected behavior?
thanks a lot
--
Cesare