thx for advice!
now i know what's up.
but my OS is WinXp(CHINESE), it supports Chinese very well. and i used the
LUKE to see the index, ant there are messy character when crawl the Chinese
webs.
  so ,how can i deal with it??

any reply will be appreciated.

On 4/2/06, Dan Morrill <[EMAIL PROTECTED]> wrote:
>
> Good Morning Kauu,
>
> I have noticed that Nutch only knows about UTF-8 character codes, so the
> simplified Chinese character set is UTF-8 and should come out ok. If the
> crawl sees Chinese in a non-utf-8, the web site may be serving them under
> an
> older ISO standard, or you may not have the language pack installed to
> properly support Chinese.
>
> Personally, I would download the language pack for your Operating system
> and
> see what happens.
>
> r/d
>
> -----Original Message-----
> From: kauu [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 02, 2006 7:48 AM
> To: [email protected]
> Subject: hi all
>
> hi all:
>    i get a big problem when crawl the ftp.
>   it seems that Nutch couldn't parse or index the files named in
> Chinese!!!!
> so after the command looks like:
>
> bin/nutch crawl urls.txt -dir test.dir
>
> (i've modified the crawl-urlfilter.txt)
>
>
> # skip file:, ftp:, & mailto: urls
> #-^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|M
> OV|exe|png|PNG)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # accept hosts in MY.DOMAIN.NAME
> +^ftp://*
>
>
> when i seach something in tomcat 5.0.28 ,the results are messy character.
> so anyone can tell me anything helpful to solve this big problem to me.
> any reply will be appreciated.
>
> --
> www.babatu.com
>
>


--
www.babatu.com

Reply via email to