[
https://issues.apache.org/jira/browse/NUTCH-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241893#comment-16241893
]
Sebastian Nagel edited comment on NUTCH-2451 at 11/7/17 12:01 PM:
------------------------------------------------------------------
Ok, after a look at the code (Ftp.java): it's during redirect handling. I
didn't check the Ftp spec but in HTTP redirects may absolute or relative. For
the latter case it should be: {{u = new URL(u,
response.getHeader("Location"));}} (within a try block to catch and log the
exception with URL and redirect location).
was (Author: wastl-nagel):
Ok, after a look at the code (Ftp.java): it's during redirect handling. I
didn't check the Ftp spec but in HTTP redirects may absolute or relative. For
the latter case it should be: {{ u = new URL(u,
response.getHeader("Location"));}} (within a try block to catch and log the
exception with URL and redirect location).
> MalformedURLExceptions on perfectly looking URLs?
> -------------------------------------------------
>
> Key: NUTCH-2451
> URL: https://issues.apache.org/jira/browse/NUTCH-2451
> Project: Nutch
> Issue Type: Bug
> Components: protocol
> Affects Versions: 1.13
> Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
> Reporter: Hiran Chaudhuri
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl
> ftp://nas.
> The experience gives me varying results which seem to point to problems
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error
> message I changed the method
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not
> only return protocol status but send the full exception and stack trace to
> the logs:
> {{ } catch (Exception e) {
> LOG.warn("Could not get {}", url, e);
> return new ProtocolOutput(null, new ProtocolStatus(e));
> }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching
> ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
> 2017-10-25 22:09:32,147 WARN org.apache.nutch.protocol.ftp.Ftp - Could not
> get ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
> java.net.MalformedURLException
> at java.net.URL.<init>(URL.java:627)
> at java.net.URL.<init>(URL.java:490)
> at java.net.URL.<init>(URL.java:439)
> at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> Caused by: java.lang.NullPointerException
> }}
> Please mind the URL was not configured from me. Instead it was obtained by
> crawling my NAS. Also the URL looks perfectly fine to me. Even if the file
> did not exist I would not expect a MalformedURLException to occur. Even more,
> using Firefox and the same authentication data on the same URL retrieves the
> file successfully.
> How come Nutch cannot get the file?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)