2006/10/18, Frederic Goudal <[EMAIL PROTECTED]>:

Hello,

I'm begining to play with nutch to index our own web site.
I have done a first crawl and I have trid the recrawl script.
While fetching I have lines like that :

fetching http://www.yourdictionary.com/grammars.html
fetching http://www.cours.polymtl.ca/if540/hiv_00.htm
fetching http://www.maxim-ic.com/quick_view2.cfm/qv_pk/</font></a>

but by crawl-urlfilter.txt is :

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|
exe|png)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*enseirb.fr/
+^http://www.enseirb.fr/

# skip everything else
-.

So... I think I miss some point.

Frederic, what exactly is the problem? You'd like the recrawl not to
leave your web site? You can do that very easily: set the
"db.ignore.external.links" property in nutch-site.xml to "true" (you
can copy the xml property from nutch-default and then change the value
to "true");

Btw as a beginner, totally ignorant of java, and timeless system ingeneer in
charge of too many things, is there any doc that really explain the behaviour
of nutch ?

A good place to read about nutch is the nutch wiki:
http://wiki.apache.org/nutch/

Cheers,
t.n.a.

Reply via email to