Kauu,

Are you using the simplified Chinese character localaization package for
windows XP, or are you using the non simplied UTF version? You might need an
IME from here
http://www.microsoft.com/windows/ie/downloads/recommended/ime/default.mspx 

That may help out. 

Since you are using Luke to see the index, luke may not have the character
support built in for non utf-8 character sets (meaning gork when you look at
it). I went to the luke site http://www.getopt.org/luke/ to see if they make
mention of the character sets they support, but there is nothing that states
they support any character set. 

When you run your search, do you see good characters, or do you see gork?
Luke may not be able to understand the ISO character sets. (Hypothesis). 

r/d

-----Original Message-----
From: kauu [mailto:[EMAIL PROTECTED] 
Sent: Sunday, April 02, 2006 8:31 AM
To: [email protected]
Subject: Re: hi all

thx for advice!
now i know what's up.
but my OS is WinXp(CHINESE), it supports Chinese very well. and i used the
LUKE to see the index, ant there are messy character when crawl the Chinese
webs.
  so ,how can i deal with it??

any reply will be appreciated.

On 4/2/06, Dan Morrill <[EMAIL PROTECTED]> wrote:
>
> Good Morning Kauu,
>
> I have noticed that Nutch only knows about UTF-8 character codes, so the
> simplified Chinese character set is UTF-8 and should come out ok. If the
> crawl sees Chinese in a non-utf-8, the web site may be serving them under
> an
> older ISO standard, or you may not have the language pack installed to
> properly support Chinese.
>
> Personally, I would download the language pack for your Operating system
> and
> see what happens.
>
> r/d
>
> -----Original Message-----
> From: kauu [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 02, 2006 7:48 AM
> To: [email protected]
> Subject: hi all
>
> hi all:
>    i get a big problem when crawl the ftp.
>   it seems that Nutch couldn't parse or index the files named in
> Chinese!!!!
> so after the command looks like:
>
> bin/nutch crawl urls.txt -dir test.dir
>
> (i've modified the crawl-urlfilter.txt)
>
>
> # skip file:, ftp:, & mailto: urls
> #-^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|M
> OV|exe|png|PNG)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # accept hosts in MY.DOMAIN.NAME
> +^ftp://*
>
>
> when i seach something in tomcat 5.0.28 ,the results are messy character.
> so anyone can tell me anything helpful to solve this big problem to me.
> any reply will be appreciated.
>
> --
> www.babatu.com
>
>


--
www.babatu.com



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to