Hi all, I am looking into fixing some very weird behavior of the file protocol. I am using 0.8.
Researching this topic I found http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html and http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch I am on Ubuntu but I have the same problem that nutch is going down the tree (including parents) and not up (including children from the root url). I have in urls/nutch: file:///home/thorsten/src/BOJA/repositories/boja/ and my crawl-urlfilter.txt looks like: -^(http|ftp|mailto): -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz| mov|MOV|exe|png)$ [EMAIL PROTECTED] -.*(/.+?)/.*?\1/.*?\1/ # accept filepath +^file:///home/thorsten/src/BOJA(.*) -^file:/(.*).svn I patched org.apache.nutch.protocol.file.FileResponse like described in the folge2 site, recompiled (ant clean; ant) but still it is fetching down and not up. Can somebody give me some hints how to fix that? Further I would vote to make the fetch-parents optional and defined per a property whether I would like this not very intuitive "feature". TIA for any feedback. salu2