[ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786701#action_12786701
 ] 

Robert Muir commented on LUCENE-1343:
-------------------------------------

The big picture here and all these other duplicated normalization issues across 
jira is related to the outdated unicode support in the JDK. 

This issue speaks of removing diacritical marks / NSM's, but the underlying 
issue is missing unicode normalization, duplicated here (incorrectly named): 
LUCENE-1215 and also here: LUCENE-1488 (disclaimer: my impl)

Speaking for the accent removal: In truth I do not think we should be simply 
removing NSMs because in most cases, they are there for a reason. For example, 
they are diacritics in a lot of european languages, but for many eastern 
languages they are the actual vowels. (i.e. all the indic scripts)

We need to separate the issue of missing unicode normalization (which is 
clearly something lucene needs), from the issue of removing diacritics (which 
is language-specific and doing it based on unicode properties is inappropriate).

Finally just normalizing unicode in Lucene by itself is not very useful, 
because there is a careful interaction with other processes and attention needs 
to be paid to the order in which filters are run. For example, its interaction 
with case folding can be a bit tricky. If you are interested in this issue I 
urge you to read the javadocs writeup I placed in the ICUNormalizationFilter in 
LUCENE-1488.


> A replacement for ISOLatin1AccentFilter that does a more thorough job of 
> removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, 
> UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
> marks and replaces them with a version of that character with the diacritical 
> mark removed.  For example é becomes e.  However another equally valid way of 
> representing an accented character in Unicode is to have the unaccented 
> character followed by a non-spacing modifier character (like this:  é  )    
> The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
> characters at all.    Additionally there are some instances where a word will 
> contain what looks like an accented character, that is actually considered to 
> be a separate unaccented character  such as  Ł  but which to make searching 
> easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks 
> whether they occur as composed characters or decomposed characters, it can 
> also handle cases where as described above characters that look like they 
> have diacritics (but don't) are to be folded onto the letter that they look 
> like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to