recrawls file protocol causes Errors/Exceptions when actually not modified or 
gone
----------------------------------------------------------------------------------

                 Key: NUTCH-911
                 URL: https://issues.apache.org/jira/browse/NUTCH-911
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.1
            Reporter: Peter Lundberg
            Priority: Minor


When recrawling file systems file are marked as error and logging occurs such 
as:

java.net.MalformedURLException
        at java.net.URL.<init>(URL.java:601)
        at java.net.URL.<init>(URL.java:464)
        at java.net.URL.<init>(URL.java:413)
        at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:85)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:627)
fetch of file:/Users/peter.lundberg/Documents/valtech/scan-test/Peter Lundberg 
20090929.pdf failed with: java.net.MalformedURLException

This is due to FileResponse and File not working well together. The same is 
true for files that after a while disappear from the file system being crawled 
(ie error instead of GONE). I am too new with nutch to know the design rational 
behind this or any sideaffect. Below is a patch that I have used that cleans up 
the segment data and removevs false errors in the log file.

--- src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java  
(revision 997976)
+++ src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java  
(working copy)
@@ -79,6 +79,10 @@
         if (code == 200) {                          // got a good response
           return new ProtocolOutput(response.toContent());              // 
return it
   
+        } else if (code == 404) {                   // handle no such file
+          return new ProtocolOutput(response.toContent(), 
ProtocolStatus.STATUS_GONE );  
+        } else if (code == 304) {                   // handle not modified
+          return new ProtocolOutput(response.toContent(), 
ProtocolStatus.STATUS_NOTMODIFIED );  
         } else if (code >= 300 && code < 400) {     // handle redirect
           if (redirects == MAX_REDIRECTS)
             throw new FileException("Too many redirects: " + url);


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to