[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel resolved NUTCH-1483. ------------------------------------ Resolution: Fixed Fix Version/s: (was: 2.4) 2.3 Committed including NUTCH-1879, NUTCH-1880, and NUTCH-1885 to trunk and 2.x, r1636736. Thanks to [~ararog] for reporting this problem and to [~angela_wang] for review and testing! > Can't crawl filesystem with protocol-file plugin > ------------------------------------------------ > > Key: NUTCH-1483 > URL: https://issues.apache.org/jira/browse/NUTCH-1483 > Project: Nutch > Issue Type: Bug > Components: protocol > Affects Versions: 1.6, 2.1 > Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4 > Reporter: Rogério Pereira Araújo > Priority: Critical > Fix For: 2.3, 1.10 > > Attachments: TestProtocolFileUrlUri.java > > > I tried to follow the same steps described in this wiki page: > http://wiki.apache.org/nutch/IntranetDocumentSearch > I made all required changes on regex-urlfilter.txt and added the following > entry in my seed file: > file:///home/rogerio/Documents/ > The permissions are ok, I'm running nutch with the same user as folder owner, > so nutch has all the required permissions, unfortunately I'm getting the > following error: > org.apache.nutch.protocol.file.FileError: File Error: 404 > at > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) > fetch of file://home/rogerio/Documents/ failed with: > org.apache.nutch.protocol.file.FileError: File Error: 404 > Why the logs are showing file://home/rogerio/Documents/ instead of > file:///home/rogerio/Documents/ ??? > Note: The regex-urlfilter entry only works as expected if I add the entry > +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ > as wiki says. -- This message was sent by Atlassian JIRA (v6.3.4#6332)