[jira] Updated: (CODEC-107) Enhance documentation for Language Encoders

Marc Pompl (JIRA) Fri, 10 Dec 2010 16:27:25 -0800

     [ 
https://issues.apache.org/jira/browse/CODEC-107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Marc Pompl updated CODEC-107:
-----------------------------

    Description: 
The current userguide (http://commons.apache.org/codec/userguide.html) just 
lists four Language Encoders, but there are five at the moment. CODEC-106 
implements a sixth one.
Would be a good idea, to complete documentation.

Additionally, I suggest to extent the userguide in order to show a simple 
performance measurement:

_SNAP_

Metaphone encodings per sec: 32258
DoubleMetaphone encodings per sec: 31250
Soundex encodings per sec: 35714
RefinedSoundex encodings per sec: 34482
Caverphone encodings per sec: 5813
ColognePhonetic encodings per sec: 33333

So, Caverphone is much slower than any other algorithm. All others show off 
nearly the same performance.

Checked with the following code:

{code:java}
  public void checkSpeed() throws Exception {
          checkSpeedEncoding("Metaphone", "easgasg", "ESKS");
          checkSpeedEncoding("DoubleMetaphone", "easgasg", "ASKS");
          checkSpeedEncoding("Soundex", "easgasg", "E220");
          checkSpeedEncoding("RefinedSoundex", "easgasg", "E034034");
          checkSpeedEncoding("Caverphone", "Carlene", "KLN1111111");
          checkSpeedEncoding("ColognePhonetic", "Schmitt", "862");
  }
  
  private void checkSpeedEncoding(String encoder, String toBeEncoded, String 
estimated) throws Exception {
          long start = System.currentTimeMillis();
          for ( int i=0; i<REPEATS; i++) {
                    assertAlgorithm(encoder, "false", toBeEncoded,
                            new String[] { estimated });
          }
          long duration = System.currentTimeMillis()-start;
          System.out.println(encoder + " encodings per sec: 
"+(REPEATS/(duration/1000)));
  }
{code}

_SNAP_

  was:
The current userguide (http://commons.apache.org/codec/userguide.html) just 
lists four Language Encoders, but there are five at the moment. CODEC-106 
implements a sixth one.
Would be a good idea, to complete documentation.

Additionally, I suggest to extent the wiki 
(http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory)
 in order to show a simple performance measurement:

_SNAP_

Metaphone encodings per sec: 32258
DoubleMetaphone encodings per sec: 31250
Soundex encodings per sec: 35714
RefinedSoundex encodings per sec: 34482
Caverphone encodings per sec: 5813
ColognePhonetic encodings per sec: 33333

So, Caverphone is much slower than any other algorithm. All others show off 
nearly the same performance.

Checked with the following code:

{code:java}
  public void checkSpeed() throws Exception {
          checkSpeedEncoding("Metaphone", "easgasg", "ESKS");
          checkSpeedEncoding("DoubleMetaphone", "easgasg", "ASKS");
          checkSpeedEncoding("Soundex", "easgasg", "E220");
          checkSpeedEncoding("RefinedSoundex", "easgasg", "E034034");
          checkSpeedEncoding("Caverphone", "Carlene", "KLN1111111");
          checkSpeedEncoding("ColognePhonetic", "Schmitt", "862");
  }
  
  private void checkSpeedEncoding(String encoder, String toBeEncoded, String 
estimated) throws Exception {
          long start = System.currentTimeMillis();
          for ( int i=0; i<REPEATS; i++) {
                    assertAlgorithm(encoder, "false", toBeEncoded,
                            new String[] { estimated });
          }
          long duration = System.currentTimeMillis()-start;
          System.out.println(encoder + " encodings per sec: 
"+(REPEATS/(duration/1000)));
  }
{code}

_SNAP_


> Enhance documentation for Language Encoders
> -------------------------------------------
>
>                 Key: CODEC-107
>                 URL: https://issues.apache.org/jira/browse/CODEC-107
>             Project: Commons Codec
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Marc Pompl
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The current userguide (http://commons.apache.org/codec/userguide.html) just 
> lists four Language Encoders, but there are five at the moment. CODEC-106 
> implements a sixth one.
> Would be a good idea, to complete documentation.
> Additionally, I suggest to extent the userguide in order to show a simple 
> performance measurement:
> _SNAP_
> Metaphone encodings per sec: 32258
> DoubleMetaphone encodings per sec: 31250
> Soundex encodings per sec: 35714
> RefinedSoundex encodings per sec: 34482
> Caverphone encodings per sec: 5813
> ColognePhonetic encodings per sec: 33333
> So, Caverphone is much slower than any other algorithm. All others show off 
> nearly the same performance.
> Checked with the following code:
> {code:java}
>   public void checkSpeed() throws Exception {
>         checkSpeedEncoding("Metaphone", "easgasg", "ESKS");
>         checkSpeedEncoding("DoubleMetaphone", "easgasg", "ASKS");
>         checkSpeedEncoding("Soundex", "easgasg", "E220");
>         checkSpeedEncoding("RefinedSoundex", "easgasg", "E034034");
>         checkSpeedEncoding("Caverphone", "Carlene", "KLN1111111");
>         checkSpeedEncoding("ColognePhonetic", "Schmitt", "862");
>   }
>   
>   private void checkSpeedEncoding(String encoder, String toBeEncoded, String 
> estimated) throws Exception {
>         long start = System.currentTimeMillis();
>         for ( int i=0; i<REPEATS; i++) {
>                   assertAlgorithm(encoder, "false", toBeEncoded,
>                           new String[] { estimated });
>         }
>         long duration = System.currentTimeMillis()-start;
>         System.out.println(encoder + " encodings per sec: 
> "+(REPEATS/(duration/1000)));
>   }
> {code}
> _SNAP_

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CODEC-107) Enhance documentation for Language Encoders

Reply via email to