thx very much ! i'll try it later. i see good characters when i search in my own tomcat,just a little is the messy ones.but it's all messy characters when i search my ftp index.
On 4/2/06, Dan Morrill <[EMAIL PROTECTED]> wrote: > > Kauu, > > Are you using the simplified Chinese character localaization package for > windows XP, or are you using the non simplied UTF version? You might need > an > IME from here > http://www.microsoft.com/windows/ie/downloads/recommended/ime/default.mspx > > That may help out. > > Since you are using Luke to see the index, luke may not have the character > support built in for non utf-8 character sets (meaning gork when you look > at > it). I went to the luke site http://www.getopt.org/luke/ to see if they > make > mention of the character sets they support, but there is nothing that > states > they support any character set. > > When you run your search, do you see good characters, or do you see gork? > Luke may not be able to understand the ISO character sets. (Hypothesis). > > r/d > > -----Original Message----- > From: kauu [mailto:[EMAIL PROTECTED] > Sent: Sunday, April 02, 2006 8:31 AM > To: [email protected] > Subject: Re: hi all > > thx for advice! > now i know what's up. > but my OS is WinXp(CHINESE), it supports Chinese very well. and i used the > LUKE to see the index, ant there are messy character when crawl the > Chinese > webs. > so ,how can i deal with it?? > > any reply will be appreciated. > > On 4/2/06, Dan Morrill <[EMAIL PROTECTED]> wrote: > > > > Good Morning Kauu, > > > > I have noticed that Nutch only knows about UTF-8 character codes, so the > > simplified Chinese character set is UTF-8 and should come out ok. If the > > crawl sees Chinese in a non-utf-8, the web site may be serving them > under > > an > > older ISO standard, or you may not have the language pack installed to > > properly support Chinese. > > > > Personally, I would download the language pack for your Operating system > > and > > see what happens. > > > > r/d > > > > -----Original Message----- > > From: kauu [mailto:[EMAIL PROTECTED] > > Sent: Sunday, April 02, 2006 7:48 AM > > To: [email protected] > > Subject: hi all > > > > hi all: > > i get a big problem when crawl the ftp. > > it seems that Nutch couldn't parse or index the files named in > > Chinese!!!! > > so after the command looks like: > > > > bin/nutch crawl urls.txt -dir test.dir > > > > (i've modified the crawl-urlfilter.txt) > > > > > > # skip file:, ftp:, & mailto: urls > > #-^(file|ftp|mailto): > > > > # skip image and other suffixes we can't yet parse > > > > > > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|M > > OV|exe|png|PNG)$ > > > > # skip URLs containing certain characters as probable queries, etc. > > [EMAIL PROTECTED] > > > > # accept hosts in MY.DOMAIN.NAME > > +^ftp://* > > > > > > when i seach something in tomcat 5.0.28 ,the results are messy > character. > > so anyone can tell me anything helpful to solve this big problem to me. > > any reply will be appreciated. > > > > -- > > www.babatu.com > > > > > > > -- > www.babatu.com > > -- www.babatu.com
