[jira] Created: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Robert Haschart (JIRA) Tue, 22 Jul 2008 12:29:53 -0700

A replacement for ISOLatin1AccentFilter that does a more thorough job of 
removing diacritical marks or non-spacing modifiers.
-----------------------------------------------------------------------------------------------------------------------------


                 Key: LUCENE-1343
                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Robert Haschart
            Priority: Minor


The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks 
and replaces them with a version of that character with the diacritical mark 
removed.  For example é becomes e.  However another equally valid way of 
representing an accented character in Unicode is to have the unaccented 
character followed by a non-spacing modifier character (like this:  é  )    
The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
characters at all.    Additionally there are some instances where a word will 
contain what looks like an accented character, that is actually considered to 
be a separate unaccented character  such as  Ł  but which to make searching 
easier you want to fold onto the latin1  lookalike  version   L  .   

The UnicodeNormalizationFilter can filter out accents and diacritical marks 
whether they occur as composed characters or decomposed characters, it can also 
handle cases where as described above characters that look like they have 
diacritics (but don't) are to be folded onto the letter that they look like ( Ł 
 -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Reply via email to