[jira] [Commented] (SOLR-3245) Poor performance of Hunspell with Polish Dictionary

Ales Perme (JIRA) Tue, 17 Jul 2012 06:47:38 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416187#comment-13416187
 ]


Ales Perme commented on SOLR-3245:
----------------------------------

Hi! I have the same problem with Slovenian dictionary in SOLR version 3.6. 
Performance comparisons:

SOLR 3.1 + Hunspell: indexing speed 285 documents/s  
SOLR 3.6 + Hunspell: indexing speed 23 documents/s. 
SOLR 3.6 without Hunspell: indexing speed 110 documents/s. 

Wierd... 

SCHEMA:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100" 
autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1" stemEnglishPossessive="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" 
enablePositionIncrements="true"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.HunspellStemFilterFactory" 
dictionary="dictionaries/sl_SI.dic" affix="dictionaries/sl_SI.aff" 
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1" stemEnglishPossessive="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" 
enablePositionIncrements="true"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.HunspellStemFilterFactory" 
dictionary="dictionaries/sl_SI.dic" affix="dictionaries/sl_SI.aff" 
ignoreCase="true"/>
</analyzer>
</fieldType>
                
> Poor performance of Hunspell with Polish Dictionary
> ---------------------------------------------------
>
>                 Key: SOLR-3245
>                 URL: https://issues.apache.org/jira/browse/SOLR-3245
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 4.0-ALPHA
>         Environment: Centos 6.2, kernel 2.6.32, 2 physical CPU Xeon 5606 (4 
> cores each), 32 GB RAM, 2 SSD disks in RAID 0, java version 1.6.0_26, java 
> settings -server -Xms4096M -Xmx4096M 
>            Reporter: Agnieszka
>              Labels: performance
>         Attachments: pl_PL.zip
>
>
> In Solr 4.0 Hunspell stemmer with polish dictionary has poor performance 
> whereas performance of hunspell from 
> http://code.google.com/p/lucene-hunspell/ in solr 3.4 is very good. 
> Tests shows:
> Solr 3.4, full import 489017 documents:
> StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec 
> HunspellStemFilterFactory - 3922 seconds, 125 docs/sec
> Solr 4.0, full import 489017 documents:
> StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec 
> HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec
> My schema is quit easy. For Hunspell I have one text field I copy 14 text 
> fields to:
> {code:xml}
> "<field name="text" type="text_pl_hunspell" indexed="true" stored="false" 
> multiValued="true"/>"
> <copyField source="field1" dest="text"/>  
> ....
> <copyField source="field14" dest="text"/>
> {code}
> The "text_pl_hunspell" configuration:
> {code:xml}
> <fieldType name="text_pl_hunspell" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory" 
> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <!--filter class="solr.KeywordMarkerFilterFactory" 
> protected="protwords_pl.txt"/-->
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" 
> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory" 
> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <filter class="solr.KeywordMarkerFilterFactory" 
> protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
> {code}
> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, 
> synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I 
> used in 3.4 version. 
> For Polish Stemmer the diffrence is only in definion text field:
> {code}
> "<field name="text" type="text_pl" indexed="true" stored="false" 
> multiValued="true"/>"
>     <fieldType name="text_pl" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" 
> protected="dict/protwords_pl.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" 
> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" 
> protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
> {code}
> One document has 23 fields:
> - 14 text fields copy to one text field (above) that is only indexed
> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of 
> one document is 3-4 kB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-3245) Poor performance of Hunspell with Polish Dictionary

Reply via email to