[ 
https://issues.apache.org/jira/browse/NUTCH-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2459:
-----------------------------------
    Fix Version/s: 1.15

> Nutch cannot download/parse some files via FTP
> ----------------------------------------------
>
>                 Key: NUTCH-2459
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2459
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.13
>         Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>            Reporter: Hiran Chaudhuri
>             Fix For: 1.15
>
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas.
> The experience gives me varying results which seem to point to problems 
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{ } catch (Exception e) {
> LOG.warn("Could not get {}", url, e);
> return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-11-09 23:44:56,135 WARN  org.apache.nutch.protocol.ftp.Ftp - Error: 
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>         at java.util.LinkedList.checkElementIndex(LinkedList.java:555)
>         at java.util.LinkedList.get(LinkedList.java:476)
>         at 
> org.apache.nutch.protocol.ftp.FtpResponse.getFileAsHttpResponse(FtpResponse.java:327)
>         at 
> org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:267)
>         at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:133)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> 2017-11-09 23:44:56,135 ERROR org.apache.nutch.protocol.ftp.Ftp - Could not 
> get protocol output for ftp://nas/MediaPC/boot/memtest86+.elf
> org.apache.nutch.protocol.ftp.FtpException: 
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>       at 
> org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:309)
>       at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:133)
>       at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>       at java.util.LinkedList.checkElementIndex(LinkedList.java:555)
>       at java.util.LinkedList.get(LinkedList.java:476)
>       at 
> org.apache.nutch.protocol.ftp.FtpResponse.getFileAsHttpResponse(FtpResponse.java:327)
> }}
> I cannot tell what the URLs showing this problems have in common. They seem 
> to be regular files, however a lot of other regular files can be fetched and 
> parsed successfully. As far as I understand the source code, at least one 
> outgoing link is expected:
> {{
> FTPFile ftpFile = (FTPFile) list.get(0);
> }}
> Can this be safely assumed for all files? Or should there rather be a check 
> if outgoing links were found?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to