[ https://issues.apache.org/jira/browse/LUCENE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786968#action_12786968 ]
DM Smith commented on LUCENE-1343: ---------------------------------- {quote} bq. Robert Muir, Would it make sense to have a Greek filter that strips diacritics? My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not. The GreekLowerCaseFilter (incorrectly named) does this also, somewhat. it removes tone marks... but this might not be what you "want" (depending on what that is), if you are dealing with polytonic Greek (sorry for my ignorance of the biblical test you are looking at, but I think it is ancient Greek?) {quote} Yes, I'm referring to ancient Greek (grc, not el) and they are tone and breathing marks. Most ancient texts did not have these marks but modern do. Even some modern representations of the ancient. While I have several semesters of koine Greek under my belt and might be wrong, there may be ambiguities where two words have the same letters but differ on marks, but they are infrequent (I don't know of any). The GreekLowerCaseFilter appears to only do some of the work and only works on composed characters. My question is not whether I'd find the filter useful, but whether it'd be a useful addition to Lucene. {quote} bq. Similar question for Hebrew, I see value in two filters: one would strip cantillation and the other, vowel points. Or would it be better to have one that can do both depending on flags? This depends on your use case, and then you have dagesh,shin dot, too... These are all NSMs. {quote} I have a terrible habit of not being exact or using the proper terms. Shame on me. I meant that the latter strip all other marks. bq. But this is going to depend on the user, and I think every person will need their own, they can use CharFilter or other ways of defining these tables. If there is no general purpose contribution, then it should not be part of Lucene and I'll have my own. When I do work them up, I'll create an issue or two and attach the results. If they are deemed useful then they can be added to Lucene, otherwise ignored. > A replacement for ISOLatin1AccentFilter that does a more thorough job of > removing diacritical marks or non-spacing modifiers. > ----------------------------------------------------------------------------------------------------------------------------- > > Key: LUCENE-1343 > URL: https://issues.apache.org/jira/browse/LUCENE-1343 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Robert Haschart > Priority: Minor > Attachments: normalizer.jar, UnicodeCharUtil.java, > UnicodeNormalizationFilter.java, UnicodeNormalizationFilterFactory.java > > > The ISOLatin1AccentFilter takes Unicode characters that have diacritical > marks and replaces them with a version of that character with the diacritical > mark removed. For example é becomes e. However another equally valid way of > representing an accented character in Unicode is to have the unaccented > character followed by a non-spacing modifier character (like this: é ) > The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode > characters at all. Additionally there are some instances where a word will > contain what looks like an accented character, that is actually considered to > be a separate unaccented character such as Ł but which to make searching > easier you want to fold onto the latin1 lookalike version L . > The UnicodeNormalizationFilter can filter out accents and diacritical marks > whether they occur as composed characters or decomposed characters, it can > also handle cases where as described above characters that look like they > have diacritics (but don't) are to be folded onto the letter that they look > like ( Ł -> L ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org