recrawls file protocol causes Errors/Exceptions when actually not modified or
gone
----------------------------------------------------------------------------------
Key: NUTCH-911
URL: https://issues.apache.org/jira/browse/NUTCH-911
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.1
Reporter: Peter Lundberg
Priority: Minor
When recrawling file systems file are marked as error and logging occurs such
as:
java.net.MalformedURLException
at java.net.URL.<init>(URL.java:601)
at java.net.URL.<init>(URL.java:464)
at java.net.URL.<init>(URL.java:413)
at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:85)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:627)
fetch of file:/Users/peter.lundberg/Documents/valtech/scan-test/Peter Lundberg
20090929.pdf failed with: java.net.MalformedURLException
This is due to FileResponse and File not working well together. The same is
true for files that after a while disappear from the file system being crawled
(ie error instead of GONE). I am too new with nutch to know the design rational
behind this or any sideaffect. Below is a patch that I have used that cleans up
the segment data and removevs false errors in the log file.
--- src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java
(revision 997976)
+++ src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java
(working copy)
@@ -79,6 +79,10 @@
if (code == 200) { // got a good response
return new ProtocolOutput(response.toContent()); //
return it
+ } else if (code == 404) { // handle no such file
+ return new ProtocolOutput(response.toContent(),
ProtocolStatus.STATUS_GONE );
+ } else if (code == 304) { // handle not modified
+ return new ProtocolOutput(response.toContent(),
ProtocolStatus.STATUS_NOTMODIFIED );
} else if (code >= 300 && code < 400) { // handle redirect
if (redirects == MAX_REDIRECTS)
throw new FileException("Too many redirects: " + url);
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.