[ https://issues.apache.org/jira/browse/NUTCH-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-911: --------------------------------------- Fix Version/s: 1.7 > recrawls file protocol causes Errors/Exceptions when actually not modified or > gone > ---------------------------------------------------------------------------------- > > Key: NUTCH-911 > URL: https://issues.apache.org/jira/browse/NUTCH-911 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 1.1 > Reporter: Peter Lundberg > Priority: Minor > Fix For: 1.7 > > > When recrawling file systems file are marked as error and logging occurs such > as: > java.net.MalformedURLException > at java.net.URL.<init>(URL.java:601) > at java.net.URL.<init>(URL.java:464) > at java.net.URL.<init>(URL.java:413) > at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:85) > at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:627) > fetch of file:/Users/peter.lundberg/Documents/valtech/scan-test/Peter > Lundberg 20090929.pdf failed with: java.net.MalformedURLException > This is due to FileResponse and File not working well together. The same is > true for files that after a while disappear from the file system being > crawled (ie error instead of GONE). I am too new with nutch to know the > design rational behind this or any sideaffect. Below is a patch that I have > used that cleans up the segment data and removevs false errors in the log > file. > --- > src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java > (revision 997976) > +++ > src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java > (working copy) > @@ -79,6 +79,10 @@ > if (code == 200) { // got a good response > return new ProtocolOutput(response.toContent()); // > return it > > + } else if (code == 404) { // handle no such file > + return new ProtocolOutput(response.toContent(), > ProtocolStatus.STATUS_GONE ); > + } else if (code == 304) { // handle not modified > + return new ProtocolOutput(response.toContent(), > ProtocolStatus.STATUS_NOTMODIFIED ); > } else if (code >= 300 && code < 400) { // handle redirect > if (redirects == MAX_REDIRECTS) > throw new FileException("Too many redirects: " + url); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira