[ 
https://issues.apache.org/jira/browse/NUTCH-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hiran Chaudhuri updated NUTCH-2451:
-----------------------------------
    Description: 
I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
ftp://nas.
The experience gives me varying results which seem to point to problems within 
Nutch. However this may need further evaluation.

As some files could not be downloaded and I could not see a good error message 
I changed the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, 
CrawlDatum) to not only return protocol status but send the full exception and 
stack trace to the logs:

{{    } catch (Exception e) {
        LOG.warn("Could not get {}", url, e);
      return new ProtocolOutput(null, new ProtocolStatus(e));
    }
}}
With this modification I suddenly see such messages in the logfile:
{{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching 
ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
2017-10-25 22:09:32,147 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not get 
ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
java.net.MalformedURLException
        at java.net.URL.<init>(URL.java:627)
        at java.net.URL.<init>(URL.java:490)
        at java.net.URL.<init>(URL.java:439)
        at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
Caused by: java.lang.NullPointerException
}}

Please mind the URL was not configured from me. Instead it was obtained by 
crawling my NAS. Also the URL looks perfectly fine to me. Even if the file did 
not exist I would not expect a MalformedURLException to occur. Even more, using 
Firefox and the same authentication data on the same URL retrieves the file 
successfully.

How come Nutch cannot get the file?

  was:
I tried running Nutch on my Synology NAS. As SMB protocol is not contained, I 
turned on FTP service on the NAS and configured Nutch to crawl ftp://nas.
The experience gives me varying results which seem to point to problems within 
Nutch. However this may need further evaluation.

As some files could not be downloaded and I could not see a good error message 
I changed the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, 
CrawlDatum) to not only return protocol status but send the full exception and 
stack trace to the logs:

{{    } catch (Exception e) {
        LOG.warn("Could not get {}", url, e);
      return new ProtocolOutput(null, new ProtocolStatus(e));
    }
}}
With this modification I suddenly see such messages in the logfile:
{{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching 
ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
2017-10-25 22:09:32,147 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not get 
ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
java.net.MalformedURLException
        at java.net.URL.<init>(URL.java:627)
        at java.net.URL.<init>(URL.java:490)
        at java.net.URL.<init>(URL.java:439)
        at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
Caused by: java.lang.NullPointerException
}}

Please mind the URL was not configured from me. Instead it was obtained by 
crawling my NAS. Also the URL looks perfectly fine to me. Even if the file did 
not exist I would not expect a MalformedURLException to occur. Even more, using 
Firefox and the same authentication data on the same URL retrieves the file 
successfully.

How come Nutch cannot get the file?


> MalformedURLExceptions on perfectly looking URLs?
> -------------------------------------------------
>
>                 Key: NUTCH-2451
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2451
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.13
>         Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> Synology RS816
>            Reporter: Hiran Chaudhuri
>
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in 
> Nutch, I turned on FTP service on the NAS and configured Nutch to crawl 
> ftp://nas.
> The experience gives me varying results which seem to point to problems 
> within Nutch. However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error 
> message I changed the method 
> org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not 
> only return protocol status but send the full exception and stack trace to 
> the logs:
> {{    } catch (Exception e) {
>       LOG.warn("Could not get {}", url, e);
>       return new ProtocolOutput(null, new ProtocolStatus(e));
>     }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching 
> ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
> 2017-10-25 22:09:32,147 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not 
> get ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
> java.net.MalformedURLException
>       at java.net.URL.<init>(URL.java:627)
>       at java.net.URL.<init>(URL.java:490)
>       at java.net.URL.<init>(URL.java:439)
>       at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145)
>       at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
> Caused by: java.lang.NullPointerException
> }}
> Please mind the URL was not configured from me. Instead it was obtained by 
> crawling my NAS. Also the URL looks perfectly fine to me. Even if the file 
> did not exist I would not expect a MalformedURLException to occur. Even more, 
> using Firefox and the same authentication data on the same URL retrieves the 
> file successfully.
> How come Nutch cannot get the file?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to