Actually, I'm just looking to remove accentuated chars from java chars (so Unicode), only for the search (original doc should stay the same as I display), I'll just implement a TokenFilter to do this. It should be relatively simple. Just wanted to know if it had already been done (perhaps in a generic way).

Stephane

Joshua O'Madadhain wrote:

On Tue, 10 Dec 2002, stephane vaucher wrote:

I wish to implement a TokenFilter that will remove accentuated
characters so for example '�' will become 'e'. As I would rather not
reinvent the wheel, I've tried to find something on the web and on the
mailing lists. I saw a mention of a contrib that could do this (see
http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html),
but I don't see anything applicable.

It may depend on what kind of encoding you're working with.  (E.g., HTML
documents represent such characters in a different way than that of
Postscript documents.)  Probably the easiest way to handle this, if you
want to avoid such questions, would be to convert all your input documents
(and query text) to Java (Unicode) strings, and then do a
search-and-replace with the appropriate character-pair arguments.  (After
this is done, you would then do whatever Lucene processing (indexing,
query parsing, etc.) was appropriate.  I am not aware of any code that
does this, but it should be straightforward.

Good luck--

Joshua O'Madadhain

[EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
 Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to