Good Morning Kauu,

I have noticed that Nutch only knows about UTF-8 character codes, so the
simplified Chinese character set is UTF-8 and should come out ok. If the
crawl sees Chinese in a non-utf-8, the web site may be serving them under an
older ISO standard, or you may not have the language pack installed to
properly support Chinese. 

Personally, I would download the language pack for your Operating system and
see what happens. 

r/d

-----Original Message-----
From: kauu [mailto:[EMAIL PROTECTED] 
Sent: Sunday, April 02, 2006 7:48 AM
To: [email protected]
Subject: hi all

hi all:
   i get a big problem when crawl the ftp.
  it seems that Nutch couldn't parse or index the files named in Chinese!!!!
so after the command looks like:

bin/nutch crawl urls.txt -dir test.dir

(i've modified the crawl-urlfilter.txt)


# skip file:, ftp:, & mailto: urls
#-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|M
OV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept hosts in MY.DOMAIN.NAME
+^ftp://*


when i seach something in tomcat 5.0.28 ,the results are messy character.
so anyone can tell me anything helpful to solve this big problem to me.
any reply will be appreciated.

--
www.babatu.com



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to