[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176172#comment-14176172 ]
Sebastian Nagel commented on NUTCH-1483: ---------------------------------------- But URI.toString(), UrlUtil.toASCII(String url) and toUNICODE(String url) keep (or add!) 3 slashes! > Can't crawl filesystem with protocol-file plugin > ------------------------------------------------ > > Key: NUTCH-1483 > URL: https://issues.apache.org/jira/browse/NUTCH-1483 > Project: Nutch > Issue Type: Bug > Components: protocol > Affects Versions: 1.6, 2.1 > Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4 > Reporter: Rogério Pereira Araújo > Priority: Critical > Fix For: 2.3, 1.10 > > > I tried to follow the same steps described in this wiki page: > http://wiki.apache.org/nutch/IntranetDocumentSearch > I made all required changes on regex-urlfilter.txt and added the following > entry in my seed file: > file:///home/rogerio/Documents/ > The permissions are ok, I'm running nutch with the same user as folder owner, > so nutch has all the required permissions, unfortunately I'm getting the > following error: > org.apache.nutch.protocol.file.FileError: File Error: 404 > at > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) > fetch of file://home/rogerio/Documents/ failed with: > org.apache.nutch.protocol.file.FileError: File Error: 404 > Why the logs are showing file://home/rogerio/Documents/ instead of > file:///home/rogerio/Documents/ ??? > Note: The regex-urlfilter entry only works as expected if I add the entry > +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ > as wiki says. -- This message was sent by Atlassian JIRA (v6.3.4#6332)