Poor performance of Hunspell with Polish Dictionary
---------------------------------------------------

                 Key: SOLR-3245
                 URL: https://issues.apache.org/jira/browse/SOLR-3245
             Project: Solr
          Issue Type: Bug
          Components: Schema and Analysis
    Affects Versions: 4.0
         Environment: Centos 6.2, kernel 2.6.32, 2 physical CPU Xeon 5606 (4 
cores each), 32 GB RAM, 2 SSD disks in RAID 0, java version 1.6.0_26, java 
settings -server -Xms4096M -Xmx4096M 
            Reporter: Agnieszka


In Solr 4.0 Hunspell stemmer with polish dictionary has poor performance 
whereas performance of hunspell from http://code.google.com/p/lucene-hunspell/ 
in solr 3.4 is very good. 

Tests shows:

Solr 3.4, full import 489017 documents:

StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec 
HunspellStemFilterFactory - 3922 seconds, 125 docs/sec

Solr 4.0, full import 489017 documents:

StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec 
HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec

My schema is quit easy. For Hunspell I have one text field I copy 14 text 
fields to:
{code:xml}
"<field name="text" type="text_pl_hunspell" indexed="true" stored="false" 
multiValued="true"/>"

<copyField source="field1" dest="text"/>  
....
<copyField source="field14" dest="text"/>
{code}


The "text_pl_hunspell" configuration:

{code:xml}
<fieldType name="text_pl_hunspell" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="dict/stopwords_pl.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.HunspellStemFilterFactory" 
dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
        <!--filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords_pl.txt"/-->
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" 
synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="dict/stopwords_pl.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.HunspellStemFilterFactory" 
dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="dict/protwords_pl.txt"/>
      </analyzer>
    </fieldType>
{code}

I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, 
synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I 
used in 3.4 version. 

For Polish Stemmer the diffrence is only in definion text field:
{code}
"<field name="text" type="text_pl" indexed="true" stored="false" 
multiValued="true"/>"

    <fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="dict/stopwords_pl.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StempelPolishStemFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="dict/protwords_pl.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" 
synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="dict/stopwords_pl.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StempelPolishStemFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="dict/protwords_pl.txt"/>
      </analyzer>
    </fieldType>
{code}
One document has 23 fields:
- 14 text fields copy to one text field (above) that is only indexed
- 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of 
one document is 3-4 kB.





--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to