Hi,
I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one?
In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source.
I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below.
Thanks Hannah
------------------text I'm trying to index
century palace known as la �Fundaci�n Hospital de Na. Se�ora del Pilar�
-----------------tokens outputed from StandardTokenizer
century palace known as la � Fundaci� * n * Hospital de Na Se� * ora * del Pilar � -----------------------
From: "Peter M Cipollone" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Subject: Re: Problem indexing Spanish Characters Date: Wed, 19 May 2004 11:41:28 -0400
could you send some sample text that causes this to happen?
----- Original Message ----- From: "Hannah c" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, May 19, 2004 11:30 AM Subject: Problem indexing Spanish Characters
>
> Hi,
>
> I am indexing a number of English articles on Spanish resorts. As such
> there are a number of spanish characters throught the text, most of these
> are in the place names which are the type of words I would like to use as
> queries. My problem is with the StandardTokenizer class which cuts the
word
> into two when it comes across any of the spanish characters. I had a look
at
> the source but the code was generated by JavaCC and so is not very
readable.
> I was wondering if there was a way around this problem or which area of
the
> code I would need to change to avoid this.
>
> Thanks
> Hannah Cumming
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
From: PEP AD Server Administrator <[EMAIL PROTECTED]>
Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Subject: AW: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 18:08:56 +0200
Hi Hannah, Otis I cannot help but I have excatly the same problems with special german charcters. I used snowball analyser but this does not help because the problem (tokenizing) appears before the analyser comes into action. I just posted the question "Problem tokenizing UTF-8 with geman umlauts" some minutes ago which describes my problem and Hannahs seem to be similar. Do you have also UTF-8 encoded pages?
Peter MH
-----Urspr�ngliche Nachricht----- Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 19. Mai 2004 17:42 An: Lucene Users List Betreff: Re: Problem indexing Spanish Characters
It looks like Snowball project supports Spanish: http://www.google.com/search?q=snowball spanish
If it does, take a look at Lucene Sandbox. There is a project that allows you to use Snowball analyzers with Lucene.
Otis
--- Hannah c <[EMAIL PROTECTED]> wrote: > > Hi, > > I am indexing a number of English articles on Spanish resorts. As > such > there are a number of spanish characters throught the text, most of > these > are in the place names which are the type of words I would like to > use as > queries. My problem is with the StandardTokenizer class which cuts > the word > into two when it comes across any of the spanish characters. I had a > look at > the source but the code was generated by JavaCC and so is not very > readable. > I was wondering if there was a way around this problem or which area > of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------------------------------------------------------------------Hannah Cumming
[EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
