[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Robert Muir (JIRA) Mon, 07 Dec 2009 08:19:54 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786971#action_12786971
 ]


Robert Muir commented on LUCENE-1343:
-------------------------------------

{quote}
Yes, I'm referring to ancient Greek (grc, not el) and they are tone and 
breathing marks. Most ancient texts did not have these marks but modern do. 
Even some modern representations of the ancient. While I have several semesters 
of koine Greek under my belt and might be wrong, there may be ambiguities where 
two words have the same letters but differ on marks, but they are infrequent (I 
don't know of any).
{quote}

I guess I brought this up because this is where you have several situations 
where case folding and normalization interact, eg. applying FC_NFKC set when 
case folding so that later NFK[CD] normalization will be closed, I know this is 
supposed to solve various ways the YPOGEGRAMMENI can be implemented but I 
forget the details...

This is why I think, the general purpose contribution should be case folding, 
normalization, and the stuff like this (FC_NFKC set) to make sure they work 
together...

If you later want to apply something more specialized like StringPrep, you need 
this logic anyway, see http://www.ietf.org/rfc/rfc3454.txt (especially section 
3.2) 


> A replacement for ISOLatin1AccentFilter that does a more thorough job of 
> removing diacritical marks or non-spacing modifiers.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1343
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1343
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Haschart
>            Priority: Minor
>         Attachments: normalizer.jar, UnicodeCharUtil.java, 
> UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java
>
>
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical 
> marks and replaces them with a version of that character with the diacritical 
> mark removed.  For example é becomes e.  However another equally valid way of 
> representing an accented character in Unicode is to have the unaccented 
> character followed by a non-spacing modifier character (like this:  é  )    
> The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode 
> characters at all.    Additionally there are some instances where a word will 
> contain what looks like an accented character, that is actually considered to 
> be a separate unaccented character  such as  Ł  but which to make searching 
> easier you want to fold onto the latin1  lookalike  version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks 
> whether they occur as composed characters or decomposed characters, it can 
> also handle cases where as described above characters that look like they 
> have diacritics (but don't) are to be folded onto the letter that they look 
> like ( Ł  -> L )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1343) A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.

Reply via email to