thx for advice! now i know what's up. but my OS is WinXp(CHINESE), it supports Chinese very well. and i used the LUKE to see the index, ant there are messy character when crawl the Chinese webs. so ,how can i deal with it??
any reply will be appreciated. On 4/2/06, Dan Morrill <[EMAIL PROTECTED]> wrote: > > Good Morning Kauu, > > I have noticed that Nutch only knows about UTF-8 character codes, so the > simplified Chinese character set is UTF-8 and should come out ok. If the > crawl sees Chinese in a non-utf-8, the web site may be serving them under > an > older ISO standard, or you may not have the language pack installed to > properly support Chinese. > > Personally, I would download the language pack for your Operating system > and > see what happens. > > r/d > > -----Original Message----- > From: kauu [mailto:[EMAIL PROTECTED] > Sent: Sunday, April 02, 2006 7:48 AM > To: [email protected] > Subject: hi all > > hi all: > i get a big problem when crawl the ftp. > it seems that Nutch couldn't parse or index the files named in > Chinese!!!! > so after the command looks like: > > bin/nutch crawl urls.txt -dir test.dir > > (i've modified the crawl-urlfilter.txt) > > > # skip file:, ftp:, & mailto: urls > #-^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|M > OV|exe|png|PNG)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # accept hosts in MY.DOMAIN.NAME > +^ftp://* > > > when i seach something in tomcat 5.0.28 ,the results are messy character. > so anyone can tell me anything helpful to solve this big problem to me. > any reply will be appreciated. > > -- > www.babatu.com > > -- www.babatu.com
