Hm, jibjoice, I think you keep emailing the wrong list.  You should email 
[EMAIL PROTECTED] and you are emailing [EMAIL PROTECTED] You'll get help on 
nutch-user.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: jibjoice <[EMAIL PROTECTED]>
To: hadoop-user@lucene.apache.org
Sent: Sunday, January 6, 2008 8:30:38 PM
Subject: Re: Nutch crawl problem


why i can crawl http://game.search.com but i can't crawl
http://www.search.com? conf/crawl-urlfilter is

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
#-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to
 break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*search.com/

# skip everything else
+.
 
and some host i can't crawl because have error "Generator: 0 records
selected for fetching, exiting ..." i set the same config for all
 host.why?
-- 
View this message in context:
 http://www.nabble.com/Nutch-crawl-problem-tp14327978p14657080.html
Sent from the Hadoop Users mailing list archive at Nabble.com.




Reply via email to