[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Ken Krugler (JIRA) Wed, 13 Aug 2008 20:50:48 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622432#action_12622432
 ]


Ken Krugler commented on LUCENE-1343:
-------------------------------------

Hi Robert,

FWIW, the issues being discussed here are very similar to those covered by the 
[Unicode Security Considerations|http://www.unicode.org/reports/tr36/] 
technical report #36, and associated data found in the [Unicode Security 
Mechanisms|http://www.unicode.org/reports/tr39/] technical report #39.

The fundamental issue for int'l domain name spoofing is detecting when two 
sequences of Unicode code points will render as similar glyphs...which is 
basically the same issue you're trying to address here, so that when you search 
for something you'll find all terms that "look" similar.

So for a more complete (though undoubtedly slower & bigger) solution, I'd 
suggest using ICU4J to do a NFKD normalization, then toss any combining/spacing 
marks, lower-case the result, and finally apply mappings using the data tables 
found in the technical report #39 referenced above.

-- Ken

> A replacement for ISOLatin1AccentFilter that does a more thorough job of 
> removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, 
> UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
> marks and replaces them with a version of that character with the diacritical 
> mark removed.  For example é becomes e.  However another equally valid way of 
> representing an accented character in Unicode is to have the unaccented 
> character followed by a non-spacing modifier character (like this:  é  )    
> The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
> characters at all.    Additionally there are some instances where a word will 
> contain what looks like an accented character, that is actually considered to 
> be a separate unaccented character  such as  Ł  but which to make searching 
> easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks 
> whether they occur as composed characters or decomposed characters, it can 
> also handle cases where as described above characters that look like they 
> have diacritics (but don't) are to be folded onto the letter that they look 
> like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Reply via email to