[
https://issues.apache.org/jira/browse/NUTCH-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hiran Chaudhuri updated NUTCH-2452:
-----------------------------------
Description:
I tried running Nutch on my Synology NAS. As SMB protocol is not contained in
Nutch, I turned on FTP service on the NAS and configured Nutch to crawl
ftp://nas.
The experience gives me varying results which seem to point to problems within
Nutch. However this may need further evaluation.
As some files could not be downloaded and I could not see a good error message
I changed the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text,
CrawlDatum) to not only return protocol status but send the full exception and
stack trace to the logs:
{{ } catch (Exception e) {
LOG.warn("Could not get {}", url, e);
return new ProtocolOutput(null, new ProtocolStatus(e));
}
}}
With this modification I suddenly see such messages in the logfile:
{{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching
ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
2017-10-25 14:14:37,512 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get
ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404
at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
}}
Please mind the URL was not configured from me. Instead it was obtained by
crawling my NAS. Also the URL looks perfectly fine to me. Even more, using
Firefox and the same authentication data on the same URL displays the directory
successfully. Therefore I suspect the FTP client is unable to decode the URL
such that the FTP server would understand it.
was:
I tried running Nutch on my Synology NAS. As SMB protocol is not contained, I
turned on FTP service on the NAS and configured Nutch to crawl ftp://nas.
The experience gives me varying results which seem to point to problems within
Nutch. However this may need further evaluation.
As some files could not be downloaded and I could not see a good error message
I changed the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text,
CrawlDatum) to not only return protocol status but send the full exception and
stack trace to the logs:
{{ } catch (Exception e) {
LOG.warn("Could not get {}", url, e);
return new ProtocolOutput(null, new ProtocolStatus(e));
}
}}
With this modification I suddenly see such messages in the logfile:
{{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching
ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
2017-10-25 14:14:37,512 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get
ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404
at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
}}
Please mind the URL was not configured from me. Instead it was obtained by
crawling my NAS. Also the URL looks perfectly fine to me. Even more, using
Firefox and the same authentication data on the same URL displays the directory
successfully. Therefore I suspect the FTP client is unable to decode the URL
such that the FTP server would understand it.
> Problem retrieving encoded URLs via FTP?
> ----------------------------------------
>
> Key: NUTCH-2452
> URL: https://issues.apache.org/jira/browse/NUTCH-2452
> Project: Nutch
> Issue Type: Bug
> Components: protocol
> Affects Versions: 1.13
> Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> Synology RS816
> Reporter: Hiran Chaudhuri
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl
> ftp://nas.
> The experience gives me varying results which seem to point to problems
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error
> message I changed the method
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not
> only return protocol status but send the full exception and stack trace to
> the logs:
> {{ } catch (Exception e) {
> LOG.warn("Could not get {}", url, e);
> return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching
> ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
> 2017-10-25 14:14:37,512 WARN org.apache.nutch.protocol.ftp.Ftp - Could not
> get ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
> org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404
> at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> }}
> Please mind the URL was not configured from me. Instead it was obtained by
> crawling my NAS. Also the URL looks perfectly fine to me. Even more, using
> Firefox and the same authentication data on the same URL displays the
> directory successfully. Therefore I suspect the FTP client is unable to
> decode the URL such that the FTP server would understand it.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)