Hi,

I had a quick look at the sandbox but my problem is that I don't need a spanish stemmer. However there must be a replacement tokenizer that supports foreign characters to go along with the foreign language snowball stemmers. Does anyone know where I could find one?

In answer to Peters question -yes I'm also using "UTF-8" encoded XML documents as the source.
I also put below an example of what is happening when I tokenize the text using the StandardTokenizer below.


Thanks Hannah



------------------text I'm trying to index

century palace known as la �Fundaci�n Hospital de Na. Se�ora del Pilar�

-----------------tokens outputed from StandardTokenizer

century
palace
known
as
la
�
Fundaci�    *
n               *
Hospital
de
Na
Se�          *
ora           *
del
Pilar
�
-----------------------



From: "Peter M Cipollone" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Subject: Re: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 11:41:28 -0400

could you send some sample text that causes this to happen?

----- Original Message -----
From: "Hannah c" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, May 19, 2004 11:30 AM
Subject: Problem indexing Spanish Characters


>
> Hi,
>
> I am indexing a number of English articles on Spanish resorts. As such
> there are a number of spanish characters throught the text, most of these
> are in the place names which are the type of words I would like to use as
> queries. My problem is with the StandardTokenizer class which cuts the
word
> into two when it comes across any of the spanish characters. I had a look
at
> the source but the code was generated by JavaCC and so is not very
readable.
> I was wondering if there was a way around this problem or which area of
the
> code I would need to change to avoid this.
>
> Thanks
> Hannah Cumming
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>






From: PEP AD Server Administrator <[EMAIL PROTECTED]>
Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Subject: AW: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 18:08:56 +0200


Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german
charcters. I used snowball analyser but this does not help because the
problem (tokenizing) appears before the analyser comes into action.
I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?

Peter MH

-----Urspr�ngliche Nachricht-----
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 19. Mai 2004 17:42
An: Lucene Users List
Betreff: Re: Problem indexing Spanish Characters


It looks like Snowball project supports Spanish: http://www.google.com/search?q=snowball spanish

If it does, take a look at Lucene Sandbox.  There is a project that
allows you to use Snowball analyzers with Lucene.

Otis


--- Hannah c <[EMAIL PROTECTED]> wrote: > > Hi, > > I am indexing a number of English articles on Spanish resorts. As > such > there are a number of spanish characters throught the text, most of > these > are in the place names which are the type of words I would like to > use as > queries. My problem is with the StandardTokenizer class which cuts > the word > into two when it comes across any of the spanish characters. I had a > look at > the source but the code was generated by JavaCC and so is not very > readable. > I was wondering if there was a way around this problem or which area > of the > code I would need to change to avoid this. > > Thanks > Hannah Cumming

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--------------------------------------------------------------------------------------------------------------------------------Hannah Cumming
[EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to