[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176111#comment-14176111 ]
Sebastian Nagel commented on NUTCH-1483: ---------------------------------------- You'll get it working if (1) the above mentioned rule in {{regex-normalize.xml}} is removed or urlnormalizer-regex is deactivated, and (2) seed URLs are given with only one slash after the protocol: {{file:/var/www/test.html}} (cf. [1|http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html]) (about 1) The rule cleansing repeated slashes makes sense, in general. To allow 2 slashes after a colon ':' is not enough because after "file:" there may be 3 slashes. Maybe we can formulate the normalization rule more restrictive. The default rules should work out-of-the-box for any protocol! (about 2) Somehow duplicates slip into crawldb if seeds are given with multiple slashes: {{file:///path/test.html}} also adds during inject {{file:/path/test.html}}. > Can't crawl filesystem with protocol-file plugin > ------------------------------------------------ > > Key: NUTCH-1483 > URL: https://issues.apache.org/jira/browse/NUTCH-1483 > Project: Nutch > Issue Type: Bug > Components: protocol > Affects Versions: 1.6, 2.1 > Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4 > Reporter: Rogério Pereira Araújo > Priority: Critical > Fix For: 1.10 > > Attachments: NUTCH-1483.patch > > > I tried to follow the same steps described in this wiki page: > http://wiki.apache.org/nutch/IntranetDocumentSearch > I made all required changes on regex-urlfilter.txt and added the following > entry in my seed file: > file:///home/rogerio/Documents/ > The permissions are ok, I'm running nutch with the same user as folder owner, > so nutch has all the required permissions, unfortunately I'm getting the > following error: > org.apache.nutch.protocol.file.FileError: File Error: 404 > at > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) > fetch of file://home/rogerio/Documents/ failed with: > org.apache.nutch.protocol.file.FileError: File Error: 404 > Why the logs are showing file://home/rogerio/Documents/ instead of > file:///home/rogerio/Documents/ ??? > Note: The regex-urlfilter entry only works as expected if I add the entry > +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ > as wiki says. -- This message was sent by Atlassian JIRA (v6.3.4#6332)