[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488254#comment-13488254 ]
Sebastian Nagel commented on NUTCH-1483: ---------------------------------------- Rogério, can you apply the patch, re-compile and try again? > Can't crawl filesystem with protocol-file plugin > ------------------------------------------------ > > Key: NUTCH-1483 > URL: https://issues.apache.org/jira/browse/NUTCH-1483 > Project: Nutch > Issue Type: Bug > Components: protocol > Affects Versions: 1.6, 2.1 > Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4 > Reporter: Rogério Pereira Araújo > Attachments: NUTCH-1483.patch > > > I tried to follow the same steps described in this wiki page: > http://wiki.apache.org/nutch/IntranetDocumentSearch > I made all required changes on regex-urlfilter.txt and added the following > entry in my seed file: > file:///home/rogerio/Documents/ > The permissions are ok, I'm running nutch with the same user as folder owner, > so nutch has all the required permissions, unfortunately I'm getting the > following error: > org.apache.nutch.protocol.file.FileError: File Error: 404 > at > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) > fetch of file://home/rogerio/Documents/ failed with: > org.apache.nutch.protocol.file.FileError: File Error: 404 > Why the logs are showing file://home/rogerio/Documents/ instead of > file:///home/rogerio/Documents/ ??? > Note: The regex-urlfilter entry only works as expected if I add the entry > +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ > as wiki says. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira