Hi all, 

I am looking into fixing some very weird behavior of the file protocol.
I am using 0.8.

Researching this topic I found 
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
and
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

I am on Ubuntu but I have the same problem that nutch is going down the
tree (including parents) and not up (including children from the root
url).

I have in urls/nutch:
file:///home/thorsten/src/BOJA/repositories/boja/

and my crawl-urlfilter.txt looks like:
-^(http|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|
mov|MOV|exe|png)$

[EMAIL PROTECTED]

-.*(/.+?)/.*?\1/.*?\1/

# accept filepath
+^file:///home/thorsten/src/BOJA(.*)
-^file:/(.*).svn

I patched org.apache.nutch.protocol.file.FileResponse like described in
the folge2 site, recompiled (ant clean; ant) but still it is fetching
down and not up.

Can somebody give me some hints how to fix that?

Further I would vote to make the fetch-parents optional and defined per
a property whether I would like this not very intuitive "feature".

TIA for any feedback.

salu2


Reply via email to