[
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-1343:
--------------------------------
Summary: A replacement for AsciiFoldingFilter that does a more
thorough job of removing diacritical marks or non-spacing modifiers. (was: A
replacement for ISOLatin1AccentFilter that does a more thorough job of removing
diacritical marks or non-spacing modifiers.)
Fix Version/s: 3.1
Affects Version/s: 3.1
Lucene Fields: [New, Patch Available] (was: [New])
Description:
The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks
and replaces them with a version of that character with the diacritical mark
removed. For example é becomes e. However another equally valid way of
representing an accented character in Unicode is to have the unaccented
character followed by a non-spacing modifier character (like this: é ) The
ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode
characters at all. Additionally there are some instances where a word will
contain what looks like an accented character, that is actually considered to
be a separate unaccented character such as Ł but which to make searching
easier you want to fold onto the latin1 lookalike version L .
The UnicodeNormalizationFilter can filter out accents and diacritical marks
whether they occur as composed characters or decomposed characters, it can also
handle cases where as described above characters that look like they have
diacritics (but don't) are to be folded onto the letter that they look like ( Ł
-> L )
was:
The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks
and replaces them with a version of that character with the diacritical mark
removed. For example é becomes e. However another equally valid way of
representing an accented character in Unicode is to have the unaccented
character followed by a non-spacing modifier character (like this: é )
The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode
characters at all. Additionally there are some instances where a word will
contain what looks like an accented character, that is actually considered to
be a separate unaccented character such as Ł but which to make searching
easier you want to fold onto the latin1 lookalike version L .
The UnicodeNormalizationFilter can filter out accents and diacritical marks
whether they occur as composed characters or decomposed characters, it can also
handle cases where as described above characters that look like they have
diacritics (but don't) are to be folded onto the letter that they look like ( Ł
-> L )
> A replacement for AsciiFoldingFilter that does a more thorough job of
> removing diacritical marks or non-spacing modifiers.
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-1343
> URL: https://issues.apache.org/jira/browse/LUCENE-1343
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.1
> Reporter: Robert Haschart
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-1343.patch, normalizer.jar, UnicodeCharUtil.java,
> UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java,
> utr30.nrm
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical
> marks and replaces them with a version of that character with the diacritical
> mark removed. For example é becomes e. However another equally valid way of
> representing an accented character in Unicode is to have the unaccented
> character followed by a non-spacing modifier character (like this: é )
> The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode
> characters at all. Additionally there are some instances where a word will
> contain what looks like an accented character, that is actually considered to
> be a separate unaccented character such as Ł but which to make searching
> easier you want to fold onto the latin1 lookalike version L .
> The UnicodeNormalizationFilter can filter out accents and diacritical marks
> whether they occur as composed characters or decomposed characters, it can
> also handle cases where as described above characters that look like they
> have diacritics (but don't) are to be folded onto the letter that they look
> like ( Ł -> L )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]