thx very much !
i'll try it later.
i see good characters when i search in my own tomcat,just a little is the
messy ones.but it's  all messy characters when i search my ftp index.

On 4/2/06, Dan Morrill <[EMAIL PROTECTED]> wrote:
>
> Kauu,
>
> Are you using the simplified Chinese character localaization package for
> windows XP, or are you using the non simplied UTF version? You might need
> an
> IME from here
> http://www.microsoft.com/windows/ie/downloads/recommended/ime/default.mspx
>
> That may help out.
>
> Since you are using Luke to see the index, luke may not have the character
> support built in for non utf-8 character sets (meaning gork when you look
> at
> it). I went to the luke site http://www.getopt.org/luke/ to see if they
> make
> mention of the character sets they support, but there is nothing that
> states
> they support any character set.
>
> When you run your search, do you see good characters, or do you see gork?
> Luke may not be able to understand the ISO character sets. (Hypothesis).
>
> r/d
>
> -----Original Message-----
> From: kauu [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 02, 2006 8:31 AM
> To: [email protected]
> Subject: Re: hi all
>
> thx for advice!
> now i know what's up.
> but my OS is WinXp(CHINESE), it supports Chinese very well. and i used the
> LUKE to see the index, ant there are messy character when crawl the
> Chinese
> webs.
>   so ,how can i deal with it??
>
> any reply will be appreciated.
>
> On 4/2/06, Dan Morrill <[EMAIL PROTECTED]> wrote:
> >
> > Good Morning Kauu,
> >
> > I have noticed that Nutch only knows about UTF-8 character codes, so the
> > simplified Chinese character set is UTF-8 and should come out ok. If the
> > crawl sees Chinese in a non-utf-8, the web site may be serving them
> under
> > an
> > older ISO standard, or you may not have the language pack installed to
> > properly support Chinese.
> >
> > Personally, I would download the language pack for your Operating system
> > and
> > see what happens.
> >
> > r/d
> >
> > -----Original Message-----
> > From: kauu [mailto:[EMAIL PROTECTED]
> > Sent: Sunday, April 02, 2006 7:48 AM
> > To: [email protected]
> > Subject: hi all
> >
> > hi all:
> >    i get a big problem when crawl the ftp.
> >   it seems that Nutch couldn't parse or index the files named in
> > Chinese!!!!
> > so after the command looks like:
> >
> > bin/nutch crawl urls.txt -dir test.dir
> >
> > (i've modified the crawl-urlfilter.txt)
> >
> >
> > # skip file:, ftp:, & mailto: urls
> > #-^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> >
> >
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|M
> > OV|exe|png|PNG)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > [EMAIL PROTECTED]
> >
> > # accept hosts in MY.DOMAIN.NAME
> > +^ftp://*
> >
> >
> > when i seach something in tomcat 5.0.28 ,the results are messy
> character.
> > so anyone can tell me anything helpful to solve this big problem to me.
> > any reply will be appreciated.
> >
> > --
> > www.babatu.com
> >
> >
>
>
> --
> www.babatu.com
>
>


--
www.babatu.com

Reply via email to