Hiran Chaudhuri created NUTCH-2459:
--------------------------------------

             Summary: Nutch cannot download/parse some files
                 Key: NUTCH-2459
                 URL: https://issues.apache.org/jira/browse/NUTCH-2459
             Project: Nutch
          Issue Type: Bug
          Components: protocol
    Affects Versions: 1.13
         Environment: I tried running Nutch on my Synology NAS. As SMB protocol 
is not contained in Nutch, I turned on FTP service on the NAS and configured 
Nutch to crawl ftp://nas.
The experience gives me varying results which seem to point to problems within 
Nutch. However this may need further evaluation.

As some files could not be downloaded and I could not see a good error message 
I changed the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, 
CrawlDatum) to not only return protocol status but send the full exception and 
stack trace to the logs:

{{ } catch (Exception e) {
LOG.warn("Could not get {}", url, e);
return new ProtocolOutput(null, new ProtocolStatus(e));
}
}}
With this modification I suddenly see such messages in the logfile:
{{2017-11-09 23:44:56,135 WARN  org.apache.nutch.protocol.ftp.Ftp - Error: 
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.LinkedList.checkElementIndex(LinkedList.java:555)
        at java.util.LinkedList.get(LinkedList.java:476)
        at 
org.apache.nutch.protocol.ftp.FtpResponse.getFileAsHttpResponse(FtpResponse.java:327)
        at 
org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:267)
        at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:133)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
2017-11-09 23:44:56,135 ERROR org.apache.nutch.protocol.ftp.Ftp - Could not get 
protocol output for ftp://nas/MediaPC/boot/memtest86+.elf
org.apache.nutch.protocol.ftp.FtpException: 
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at 
org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:309)
        at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:133)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.LinkedList.checkElementIndex(LinkedList.java:555)
        at java.util.LinkedList.get(LinkedList.java:476)
        at 
org.apache.nutch.protocol.ftp.FtpResponse.getFileAsHttpResponse(FtpResponse.java:327)
}}

I cannot tell what the URLs showing this problems have in common. They seem to 
be regular files, however a lot of other regular files can be fetched and 
parsed successfully. As far as I understand the source code, at least one 
outgoing link is expected:
{{
FTPFile ftpFile = (FTPFile) list.get(0);
}}

Can this be safely assumed for all files? Or should there rather be a check if 
outgoing links were found?
            Reporter: Hiran Chaudhuri






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to