RE: AW: Problem indexing Spanish Characters

Hannah c Wed, 19 May 2004 09:35:32 -0700

Hi,

I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one?

In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source. I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below.

Thanks Hannah

------------------text I'm trying to index

century palace known as la �Fundaci�n Hospital de Na. Se�ora del Pilar�

-----------------tokens outputed from StandardTokenizer

century
palace
known
as
la
�
Fundaci�    *
n               *
Hospital
de
Na
Se�          *
ora           *
del
Pilar
�
-----------------------

From: "Peter M Cipollone" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Subject: Re: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 11:41:28 -0400
could you send some sample text that causes this to happen?
----- Original Message -----
From: "Hannah c" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, May 19, 2004 11:30 AM
Subject: Problem indexing Spanish Characters
> > Hi, > > I am indexing a number of English articles on Spanish resorts. As such > there are a number of spanish characters throught the text, most of these > are in the place names which are the type of words I would like to use as > queries. My problem is with the StandardTokenizer class which cuts the word > into two when it comes across any of the spanish characters. I had a look at > the source but the code was generated by JavaCC and so is not very readable. > I was wondering if there was a way around this problem or which area of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > >

From: PEP AD Server Administrator <[EMAIL PROTECTED]> Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Subject: AW: Problem indexing Spanish Characters Date: Wed, 19 May 2004 18:08:56 +0200

Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german
charcters. I used snowball analyser but this does not help because the
problem (tokenizing) appears before the analyser comes into action.
I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?

Peter MH

-----Urspr�ngliche Nachricht-----
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 19. Mai 2004 17:42
An: Lucene Users List
Betreff: Re: Problem indexing Spanish Characters


It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish

If it does, take a look at Lucene Sandbox.  There is a project that
allows you to use Snowball analyzers with Lucene.

Otis


--- Hannah c <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I  am indexing a number of English articles on Spanish resorts. As
> such
> there are a number of spanish characters throught the text, most of
> these
> are in the place names which are the type of words I would like to
> use as
> queries. My problem is with the StandardTokenizer class which cuts
> the word
> into two when it comes across any of the spanish characters. I had a
> look at
> the source but the code was generated by JavaCC and so is not very
> readable.
> I was wondering if there was a way around this problem or which area
> of the
> code I would need to change to avoid this.
>
> Thanks
> Hannah Cumming

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--------------------------------------------------------------------------------------------------------------------------------Hannah Cumming [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: AW: Problem indexing Spanish Characters

Reply via email to