[
https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184038#comment-14184038
]
Sebastian Nagel commented on NUTCH-1483:
----------------------------------------
Not everything is ok: the url appears in two variants (1 or 3 slashes after
{{file:}}) which causes the NPE when passing scores ("Couldn't pass score").
The two additional slashes are added by URLUtil.toASCII(), see NUTCH-1880. The
NPE is ignored (but shown as warning). Does the file monitor.xml contain
outlinks? Note that there is no clear link element in XML (as opposed to HTML).
You may need a special parser to extract outlinks from a specific XML
format/schema.
But again: would be better to move questions to the user mailing list. You are
welcome to support us by testing the patches (see NUTCH-1879 and NUTCH-1880
which hopefully fix all file: protocol related problems), cf.
[Becoming_A_Nutch_Developer|https://wiki.apache.org/nutch/Becoming_A_Nutch_Developer].
> Can't crawl filesystem with protocol-file plugin
> ------------------------------------------------
>
> Key: NUTCH-1483
> URL: https://issues.apache.org/jira/browse/NUTCH-1483
> Project: Nutch
> Issue Type: Bug
> Components: protocol
> Affects Versions: 1.6, 2.1
> Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4
> Reporter: Rogério Pereira Araújo
> Priority: Critical
> Fix For: 2.3, 1.10
>
> Attachments: TestProtocolFileUrlUri.java
>
>
> I tried to follow the same steps described in this wiki page:
> http://wiki.apache.org/nutch/IntranetDocumentSearch
> I made all required changes on regex-urlfilter.txt and added the following
> entry in my seed file:
> file:///home/rogerio/Documents/
> The permissions are ok, I'm running nutch with the same user as folder owner,
> so nutch has all the required permissions, unfortunately I'm getting the
> following error:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> at
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
> at
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
> fetch of file://home/rogerio/Documents/ failed with:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> Why the logs are showing file://home/rogerio/Documents/ instead of
> file:///home/rogerio/Documents/ ???
> Note: The regex-urlfilter entry only works as expected if I add the entry
> +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/
> as wiki says.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)