[ 
https://issues.apache.org/jira/browse/CODEC-174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829991#comment-13829991
 ] 

Thomas Neidhart commented on CODEC-174:
---------------------------------------

The semantic has not changed:

 * Before a new set has been created which was used to initialize a new 
PhonemeBuilder
 * Now, the existing set of the PhonemeBuilder is reset with the values 
collected in the loop

Thus duplicates are not present in the result, although I have to admit that 
the maxPhonemes parameter might not work as expected anymore as it is applied 
on the list (which might contain duplicates).

btw. while testing before, I have seen that there are cases where the 
maxPhonemes setting is not always respected. When the final rules are applied, 
all found phonemes are added the the final result not taking into account the 
maximum allowed value.

> Improve performance of Beider Morse encoder
> -------------------------------------------
>
>                 Key: CODEC-174
>                 URL: https://issues.apache.org/jira/browse/CODEC-174
>             Project: Commons Codec
>          Issue Type: Improvement
>    Affects Versions: 1.6, 1.7
>            Reporter: Thomas Champagne
>              Labels: patch, performance
>         Attachments: CODEC-174-change-rules-storage-to-Map.patch, 
> CODEC-174-convert-set-to-list-in-apply-method.patch, 
> CODEC-174-delete-subsequence-cache-and-use-String.patch, 
> CODEC-174-delete-subsequence-cache.patch, 
> CODEC-174-refactor-join-method-in-Phoneme.patch, 
> CODEC-174-refactor-restrictTo-method-in-SomeLanguages.patch, 
> CODEC-174-reuse-set-in-PhonemeBuilder.patch, CODEC_174_cleanup.patch, 
> TestCacheSubSequence.java, test-commons-codec-test-bm.zip
>
>
> I use Beider Morse encoder with Solr. When it indexes a lot of documents 
> using this encoder, the import time is multiplied by 30. So, I have decided 
> to optimize the current implementation in the commons-codec.
> Currently, I have created two patch. The first patch delete a "performance 
> hack" about a subsequence cache. This cache doesn't optimize performance and 
> after deleting it, you can win some milliseconds.
> The second patch changes the storage of the rules in memory using a Map 
> instead of List. With it, you can access to a rule directly with the 
> beginning of pattern. This patch divide the encoding time by 2.
> I will try to find more improvement. If you have any idea, please tell me it.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to