The tokenizers deal with unicode characters (CharStream, char), so the problem is not there. This problem must be solved at the point where the bytes from your source files are turned into CharSequences/Strings, i.e. by connecting an InputStreamReader to your FileReader (or whatever you're using) and specifying "UTF-8" (or whatever encoding is appropriate) in the InputStreamReader constructor.
You must either detect the encoding from HTTP heaaders or XML declarations or, if you know that it's the same for all of your source files, then just hardcode UTF-8, for example. Martin -----Original Message----- From: Hannah c [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 10:35 AM To: [EMAIL PROTECTED] Subject: RE: AW: Problem indexing Spanish Characters Hi, I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one? In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source. I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below. Thanks Hannah ------------------text I'm trying to index century palace known as la �Fundaci�n Hospital de Na. Se�ora del Pilar� -----------------tokens outputed from StandardTokenizer century palace known as la � Fundaci� * n * Hospital de Na Se� * ora * del Pilar � ----------------------- >From: "Peter M Cipollone" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Subject: Re: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 11:41:28 -0400 > >could you send some sample text that causes this to happen? > >----- Original Message ----- >From: "Hannah c" <[EMAIL PROTECTED]> >To: <[EMAIL PROTECTED]> >Sent: Wednesday, May 19, 2004 11:30 AM >Subject: Problem indexing Spanish Characters > > > > > > Hi, > > > > I am indexing a number of English articles on Spanish resorts. As > > such there are a number of spanish characters throught the text, > > most of >these > > are in the place names which are the type of words I would like to > > use >as > > queries. My problem is with the StandardTokenizer class which cuts > > the >word > > into two when it comes across any of the spanish characters. I had a >look >at > > the source but the code was generated by JavaCC and so is not very >readable. > > I was wondering if there was a way around this problem or which area > > of >the > > code I would need to change to avoid this. > > > > Thanks > > Hannah Cumming > > > > > > > > -------------------------------------------------------------------- > > - To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > >From: PEP AD Server Administrator ><[EMAIL PROTECTED]> >Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> >To: "'Lucene Users List'" <[EMAIL PROTECTED]> >Subject: AW: Problem indexing Spanish Characters >Date: Wed, 19 May 2004 18:08:56 +0200 > >Hi Hannah, Otis >I cannot help but I have excatly the same problems with special german >charcters. I used snowball analyser but this does not help because the >problem (tokenizing) appears before the analyser comes into action. >I just posted the question "Problem tokenizing UTF-8 with geman umlauts" >some minutes ago which describes my problem and Hannahs seem to be similar. >Do you have also UTF-8 encoded pages? > >Peter MH > >-----Urspr�ngliche Nachricht----- >Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] >Gesendet: Mittwoch, 19. Mai 2004 17:42 >An: Lucene Users List >Betreff: Re: Problem indexing Spanish Characters > > >It looks like Snowball project supports Spanish: >http://www.google.com/search?q=snowball spanish > >If it does, take a look at Lucene Sandbox. There is a project that >allows you to use Snowball analyzers with Lucene. > >Otis > > >--- Hannah c <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > I am indexing a number of English articles on Spanish resorts. As > > such there are a number of spanish characters throught the text, > > most of these are in the place names which are the type of words I > > would like to use as queries. My problem is with the > > StandardTokenizer class which cuts the word into two when it comes > > across any of the spanish characters. I had a look at the source but > > the code was generated by JavaCC and so is not very readable. > > I was wondering if there was a way around this problem or which area > > of the code I would need to change to avoid this. > > > > Thanks > > Hannah Cumming > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > ---------------------------------------------------------------------------- ----------------------------------------------------Hannah Cumming [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
