RE: AW: Problem indexing Spanish Characters

Martin Remy Wed, 19 May 2004 11:09:44 -0700

The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying "UTF-8" (or whatever encoding is appropriate) in the
InputStreamReader constructor.


You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

-----Original Message-----
From: Hannah c [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 19, 2004 10:35 AM
To: [EMAIL PROTECTED]
Subject: RE: AW: Problem indexing Spanish Characters

Hi,

I had a quick look at the sandbox but my problem is that I don't need a
spanish stemmer. However there must be a replacement tokenizer that supports
foreign characters to go along with the foreign language snowball stemmers. 
Does anyone know where I could find one?

In answer to Peters question -yes I'm also using "UTF-8" encoded XML
documents as the source.
I also put below an example of what is happening when I tokenize the text
using the StandardTokenizer below.

Thanks Hannah



------------------text I'm trying to index

century palace known as la �Fundaci�n Hospital de Na. Se�ora del Pilar�

-----------------tokens outputed from StandardTokenizer

century
palace
known
as
la
�
Fundaci�    *
n               *
Hospital
de
Na
Se�          *
ora           *
del
Pilar
�
-----------------------



>From: "Peter M Cipollone" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Subject: Re: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 11:41:28 -0400
>
>could you send some sample text that causes this to happen?
>
>----- Original Message -----
>From: "Hannah c" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Sent: Wednesday, May 19, 2004 11:30 AM
>Subject: Problem indexing Spanish Characters
>
>
> >
> > Hi,
> >
> > I  am indexing a number of English articles on Spanish resorts. As 
> > such there are a number of spanish characters throught the text, 
> > most of
>these
> > are in the place names which are the type of words I would like to 
> > use
>as
> > queries. My problem is with the StandardTokenizer class which cuts 
> > the
>word
> > into two when it comes across any of the spanish characters. I had a
>look
>at
> > the source but the code was generated by JavaCC and so is not very
>readable.
> > I was wondering if there was a way around this problem or which area 
> > of
>the
> > code I would need to change to avoid this.
> >
> > Thanks
> > Hannah Cumming
> >
> >
> >
> > --------------------------------------------------------------------
> > - To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>




>From: PEP AD Server Administrator
><[EMAIL PROTECTED]>
>Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
>To: "'Lucene Users List'" <[EMAIL PROTECTED]>
>Subject: AW: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 18:08:56 +0200
>
>Hi Hannah, Otis
>I cannot help but I have excatly the same problems with special german 
>charcters. I used snowball analyser but this does not help because the 
>problem (tokenizing) appears before the analyser comes into action.
>I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
>some minutes ago which describes my problem and Hannahs seem to be similar.
>Do you have also UTF-8 encoded pages?
>
>Peter MH
>
>-----Urspr�ngliche Nachricht-----
>Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
>Gesendet: Mittwoch, 19. Mai 2004 17:42
>An: Lucene Users List
>Betreff: Re: Problem indexing Spanish Characters
>
>
>It looks like Snowball project supports Spanish:
>http://www.google.com/search?q=snowball spanish
>
>If it does, take a look at Lucene Sandbox.  There is a project that 
>allows you to use Snowball analyzers with Lucene.
>
>Otis
>
>
>--- Hannah c <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > I  am indexing a number of English articles on Spanish resorts. As 
> > such there are a number of spanish characters throught the text, 
> > most of these are in the place names which are the type of words I 
> > would like to use as queries. My problem is with the 
> > StandardTokenizer class which cuts the word into two when it comes 
> > across any of the spanish characters. I had a look at the source but 
> > the code was generated by JavaCC and so is not very readable.
> > I was wondering if there was a way around this problem or which area 
> > of the code I would need to change to avoid this.
> >
> > Thanks
> > Hannah Cumming
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>


----------------------------------------------------------------------------
----------------------------------------------------Hannah
Cumming
[EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: AW: Problem indexing Spanish Characters

Reply via email to