Good Morning Kauu, I have noticed that Nutch only knows about UTF-8 character codes, so the simplified Chinese character set is UTF-8 and should come out ok. If the crawl sees Chinese in a non-utf-8, the web site may be serving them under an older ISO standard, or you may not have the language pack installed to properly support Chinese.
Personally, I would download the language pack for your Operating system and see what happens. r/d -----Original Message----- From: kauu [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 7:48 AM To: [email protected] Subject: hi all hi all: i get a big problem when crawl the ftp. it seems that Nutch couldn't parse or index the files named in Chinese!!!! so after the command looks like: bin/nutch crawl urls.txt -dir test.dir (i've modified the crawl-urlfilter.txt) # skip file:, ftp:, & mailto: urls #-^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|M OV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept hosts in MY.DOMAIN.NAME +^ftp://* when i seach something in tomcat 5.0.28 ,the results are messy character. so anyone can tell me anything helpful to solve this big problem to me. any reply will be appreciated. -- www.babatu.com ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
