[jira] Commented: (LANG-285) Wish : method unaccent

JIRA Tue, 24 Aug 2010 10:45:42 -0700

    [ 
https://issues.apache.org/jira/browse/LANG-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901983#action_12901983
 ]


Cédrik LIME commented on LANG-285:
----------------------------------

A few remarks on the current code state in svn:
* all compiled pattern must absolutely get out of methods and into private 
static final fields. Compiling a Pattern at each method invocation is a big 
no-no for performance.
In our case, the culprit is
{{java.util.regex.Pattern accentPattern = 
java.util.regex.Pattern.compile("\\p{InCombiningDiacriticalMarks}+");}}
* the {{stripAccents(String)}} should probably accept a {{CharSequence}} 
instead of a {{String}} as {{java.text.Normalizer.normalize()}} accepts a 
{{CharSequence}}
* the same {{stripAccents(String)}} could probably be enhanced to use 
{{sun.text.Normalizer.decompose(text, false, 0)}} under Java 5 (also using 
reflexion, as this class is not available for Java 6)
* You may be very interested in LUCENE-1390 and LUCENE-1343 which solved the 
exact same problem (class {{ASCIIFoldingFilter}} and 
{{UnicodeNormalizationFilter}}). They concluded, after a lot of brain usage, 
that a gigantic map-like approach was the best for both features and 
performance (the Unicode decomposition can lead to unexpected results for some 
inputs like Ł; what we really want is ASCII folding logic, which is a 
_superset_ of Unicode decomposition).

I could get a go at the first 3 points as soon as my work schedule leaves me a 
bit a free time.
For the last point, I think further discussion would be welcomed: what exactly 
are we trying to achieve with the {{stripAccents(String)}} method?

> Wish : method unaccent
> ----------------------
>
>                 Key: LANG-285
>                 URL: https://issues.apache.org/jira/browse/LANG-285
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>            Reporter: Guillaume Coté
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LANG-285-unaccent-using-Collator.patch, LANG-285.patch, 
> MapBuilder.java, unaccent.patch, UnnacentMap.java
>
>
> I would like to add a method that replace accented caracter by unaccented 
> one.  For example, with the input String "L'été où j'ai dû aller à l'île 
> d'Anticosti commenca tôt", the method would return "L'ete ou j'ai du aller à 
> l'ile d'Anticosti commenca tot".
> I suggest to call that method unaccent and to add it in StringUtils.
> If we cannot covert all case, the first version could only covert iso-8859-1.
> If you are willing to go forward with that idea, I am willing to contribute a 
> patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LANG-285) Wish : method unaccent

Reply via email to