[ 
http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453523 ] 
            
Andrzej Bialecki  commented on NUTCH-407:
-----------------------------------------

As far as I understand it, the original issue that you refer to (and your 
issue) both come from misconfigured URLFilters - I don't understand why this 
fix is needed if you configure them properly.

First, let's establish the names for directions - normally "up" refers to a 
parent directory, and "down" refers to a child directory.

Current behavior is to collect ANY urls that we find pointing out from the 
current URL, unless prohibited by filters. In case of crawling local FS, unless 
you prohibit it in URLFilters from collecting parent dirs it will also collect 
such URLs - that's why it behaved the way it did. This behavior is consistent 
with HTTP and FTP crawling.

So, instead of your "special case" fix you should simply put the root directory 
in your URLFilters configuration. E.g. for urlfilter-regex you should put the 
following in regex-urlfilter.txt :

+^file:///c:/top/directory/
-.

> Make Nutch crawling parent directories for file protocol configurable
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-407
>                 URL: http://issues.apache.org/jira/browse/NUTCH-407
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Thorsten Scherler
>         Attachments: 407.fix.diff
>
>
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
> I am looking into fixing some very weird behavior of the file protocol.
> I am using 0.8.
> Researching this topic I found 
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
> and
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> I am on Ubuntu but I have the same problem that nutch is going down the
> tree (including parents) and not up (including children from the root
> url).
> Further I would vote to make the fetch-parents optional and defined per
> a property whether I would like this not very intuitive "feature".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to