2006/10/18, Frederic Goudal <[EMAIL PROTECTED]>:
Hello, I'm begining to play with nutch to index our own web site. I have done a first crawl and I have trid the recrawl script. While fetching I have lines like that : fetching http://www.yourdictionary.com/grammars.html fetching http://www.cours.polymtl.ca/if540/hiv_00.htm fetching http://www.maxim-ic.com/quick_view2.cfm/qv_pk/</font></a> but by crawl-urlfilter.txt is : # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV| exe|png)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*enseirb.fr/ +^http://www.enseirb.fr/ # skip everything else -. So... I think I miss some point.
Frederic, what exactly is the problem? You'd like the recrawl not to leave your web site? You can do that very easily: set the "db.ignore.external.links" property in nutch-site.xml to "true" (you can copy the xml property from nutch-default and then change the value to "true");
Btw as a beginner, totally ignorant of java, and timeless system ingeneer in charge of too many things, is there any doc that really explain the behaviour of nutch ?
A good place to read about nutch is the nutch wiki: http://wiki.apache.org/nutch/ Cheers, t.n.a.
