[Nutch-dev] [jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

2006-11-28 Thread Alan Tanaman (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453932 ] 

Alan Tanaman commented on NUTCH-407:


In our team we feel that this patch would have been beneficial in practical 
terms.  In the context of the enterprise intelligence solution which we are 
gradually porting over to Nutch, the emphasis is on ease of configuration.  We 
try to avoid exposing features such as regex filter, which although are very 
powerful for a more experienced user, are perhaps confusing to the novice.  
This is because we are primarily focused on the enterprise and less on the WWW.

This is why we preconfigure the db.ignore.external.links property to true, 
and then only the urls file is used to seed the crawl.

Our ideal is to have a collection of predefined configuration settings for 
specific scenarios -- e.g. Enterprise-XML, Enterprise-Documents, 
Enterprise-Database, Internet-News etc.  We have a script that generates 
multiple crawlers, each one with different sources to be crawled, and although 
possible, it isn't the most practical to change the filters for each one 
manually based on the individual user requirements.

I realise this patch is closed, but how about another approach that says that 
FileResponse.java looks at db.ignore.external.links and decides based on this 
whether to go up the tree.

Obviously, this would also prevent you from crawling outlinks to the WWW 
embedded in documents, but when crawling an enterprise file system, you usually 
don't want to go all over the place anyway.  As I see it, file systems are 
different to the web in that they are inherently hierarchical whereas the web 
is as its name implies, non-hierarchical.  Therefore, when crawling a file 
system, going up the tree is just as much an external URI (so to speak) as a 
link to a web site.

*Ducks for cover*

Alan

 Make Nutch crawling parent directories for file protocol configurable
 -

 Key: NUTCH-407
 URL: http://issues.apache.org/jira/browse/NUTCH-407
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Thorsten Scherler
 Assigned To: Andrzej Bialecki 
 Attachments: 407.fix.diff


 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
 I am looking into fixing some very weird behavior of the file protocol.
 I am using 0.8.
 Researching this topic I found 
 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
 and
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 I am on Ubuntu but I have the same problem that nutch is going down the
 tree (including parents) and not up (including children from the root
 url).
 Further I would vote to make the fetch-parents optional and defined per
 a property whether I would like this not very intuitive feature.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

2006-11-27 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453523 ] 

Andrzej Bialecki  commented on NUTCH-407:
-

As far as I understand it, the original issue that you refer to (and your 
issue) both come from misconfigured URLFilters - I don't understand why this 
fix is needed if you configure them properly.

First, let's establish the names for directions - normally up refers to a 
parent directory, and down refers to a child directory.

Current behavior is to collect ANY urls that we find pointing out from the 
current URL, unless prohibited by filters. In case of crawling local FS, unless 
you prohibit it in URLFilters from collecting parent dirs it will also collect 
such URLs - that's why it behaved the way it did. This behavior is consistent 
with HTTP and FTP crawling.

So, instead of your special case fix you should simply put the root directory 
in your URLFilters configuration. E.g. for urlfilter-regex you should put the 
following in regex-urlfilter.txt :

+^file:///c:/top/directory/
-.

 Make Nutch crawling parent directories for file protocol configurable
 -

 Key: NUTCH-407
 URL: http://issues.apache.org/jira/browse/NUTCH-407
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Thorsten Scherler
 Attachments: 407.fix.diff


 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
 I am looking into fixing some very weird behavior of the file protocol.
 I am using 0.8.
 Researching this topic I found 
 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
 and
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 I am on Ubuntu but I have the same problem that nutch is going down the
 tree (including parents) and not up (including children from the root
 url).
 Further I would vote to make the fetch-parents optional and defined per
 a property whether I would like this not very intuitive feature.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

2006-11-27 Thread Thorsten Scherler (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453530 ] 

Thorsten Scherler commented on NUTCH-407:
-

Hi Andrzej, thanks for your answer.
http://wiki.apache.org/nutch/FAQ#head-f64e7589b2f12792d6d781f3db23840a8f3a1e10
I made a note in the FAQ, I was up to close this issue as wont fix but do not 
have the right to do so. 
Can someone close it?

 Make Nutch crawling parent directories for file protocol configurable
 -

 Key: NUTCH-407
 URL: http://issues.apache.org/jira/browse/NUTCH-407
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Thorsten Scherler
 Attachments: 407.fix.diff


 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
 I am looking into fixing some very weird behavior of the file protocol.
 I am using 0.8.
 Researching this topic I found 
 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
 and
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 I am on Ubuntu but I have the same problem that nutch is going down the
 tree (including parents) and not up (including children from the root
 url).
 Further I would vote to make the fetch-parents optional and defined per
 a property whether I would like this not very intuitive feature.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers