[jira] [Updated] (SOLR-5921) WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false)

JIRA Thu, 27 Mar 2014 01:54:15 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Malte Hübner updated SOLR-5921:
-------------------------------

    Description: 
WordDelimiterFilterFactory generates word parts although splitting 
configuration is deactivated.

*This is the fieldType setup from my schema:*

{code}
                <fieldType name="text_de" class="solr.TextField" 
positionIncrementGap="100">
                        <analyzer type="index">
                                <tokenizer 
class="solr.WhitespaceTokenizerFactory" />
                                <filter class="solr.StopFilterFactory" 
ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" 
/>
                                <filter class="solr.WordDelimiterFilterFactory" 
stemEnglishPossessive="0" generateWordParts="0" generateNumberParts="0" 
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" 
splitOnNumerics="0" preserveOriginal="1"/>
                                <filter class="solr.LowerCaseFilterFactory" />
                        </analyzer>
                        <analyzer type="query">
                                <tokenizer 
class="solr.WhitespaceTokenizerFactory" />
                                <filter class="solr.SynonymFilterFactory" 
synonyms="lang/synonyms_de.txt" ignoreCase="true" expand="true" />
                                <filter class="solr.StopFilterFactory" 
ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" 
/>
                                <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="0" generateNumberParts="0" catenateWords="1" 
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"  
preserveOriginal="1"/>
                                <filter class="solr.LowerCaseFilterFactory" />
                        </analyzer>
                </fieldType>
{code}

The given search term is: *X-002-99-495*

WordDelimiterFilterFactory indexes the following word parts:

* X-002-99-495
* X (shouldn't be there)
* 00299495 (shouldn't be there)
* X00299495

But the 'X' should not be indexed or queried as a single term. You can see that 
splitting is completely deactivated in the schema.

I can move the charater part around in the search term:

Searching for *002-abc-99-495* gives me

* 002-abc-99-495 
* 002 (shouldn't be there)
* abc (shouldn't be there)
* 99495 (shouldn't be there)
* 002abc99495

Please have a look at the screenshot.
This is not what I expect from the configuration! I think this must be a bug.






  was:
WordDelimiterFilterFactory generates word parts although splitting 
configuration is deactivated.

This is the fieldType setup:

{code}
                <fieldType name="text_de" class="solr.TextField" 
positionIncrementGap="100">
                        <analyzer type="index">
                                <tokenizer 
class="solr.WhitespaceTokenizerFactory" />
                                <filter class="solr.StopFilterFactory" 
ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" 
/>
                                <filter class="solr.WordDelimiterFilterFactory" 
stemEnglishPossessive="0" generateWordParts="0" generateNumberParts="0" 
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" 
splitOnNumerics="0" preserveOriginal="1"/>
                                <filter class="solr.LowerCaseFilterFactory" />
                        </analyzer>
                        <analyzer type="query">
                                <tokenizer 
class="solr.WhitespaceTokenizerFactory" />
                                <filter class="solr.SynonymFilterFactory" 
synonyms="lang/synonyms_de.txt" ignoreCase="true" expand="true" />
                                <filter class="solr.StopFilterFactory" 
ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" 
/>
                                <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="0" generateNumberParts="0" catenateWords="1" 
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"  
preserveOriginal="1"/>
                                <filter class="solr.LowerCaseFilterFactory" />
                        </analyzer>
                </fieldType>
{code}

The given search term is: *X-002-99-495*

WordDelimiterFilterFactory indexes the following word parts:

* X-002-99-495
* X (shouldn't be there)
* 00299495 (shouldn't be there)
* X00299495

But the 'X' should not be indexed or queried as a single term. You can see that 
splitting is completely deactivated in the schema.

I can move the charater part around in the search term:

Searching for *002-abc-99-495* gives me

* 002-abc-99-495 
* 002 (shouldn't be there)
* abc (shouldn't be there)
* 99495 (shouldn't be there)
* 002abc99495

Please have a look at the screenshot.
This is not what I expect from the configuration! I think this must be a bug.







> WordDelimiterFilterFactory splits up hyphenated terms although 
> splitOnNumerics, generateWordParts and generateNumberParts are set to 0 
> (false)
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-5921
>                 URL: https://issues.apache.org/jira/browse/SOLR-5921
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 4.7
>            Reporter: Malte Hübner
>             Fix For: 4.7.1
>
>         Attachments: 2014-03-27 09_50_33-Solr Admin.png
>
>
> WordDelimiterFilterFactory generates word parts although splitting 
> configuration is deactivated.
> *This is the fieldType setup from my schema:*
> {code}
>               <fieldType name="text_de" class="solr.TextField" 
> positionIncrementGap="100">
>                       <analyzer type="index">
>                               <tokenizer 
> class="solr.WhitespaceTokenizerFactory" />
>                               <filter class="solr.StopFilterFactory" 
> ignoreCase="true" words="lang/stopwords_de.txt" 
> enablePositionIncrements="true" />
>                               <filter class="solr.WordDelimiterFilterFactory" 
> stemEnglishPossessive="0" generateWordParts="0" generateNumberParts="0" 
> catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" 
> splitOnNumerics="0" preserveOriginal="1"/>
>                               <filter class="solr.LowerCaseFilterFactory" />
>                       </analyzer>
>                       <analyzer type="query">
>                               <tokenizer 
> class="solr.WhitespaceTokenizerFactory" />
>                               <filter class="solr.SynonymFilterFactory" 
> synonyms="lang/synonyms_de.txt" ignoreCase="true" expand="true" />
>                               <filter class="solr.StopFilterFactory" 
> ignoreCase="true" words="lang/stopwords_de.txt" 
> enablePositionIncrements="true" />
>                               <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="0" generateNumberParts="0" catenateWords="1" 
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" 
>  preserveOriginal="1"/>
>                               <filter class="solr.LowerCaseFilterFactory" />
>                       </analyzer>
>               </fieldType>
> {code}
> The given search term is: *X-002-99-495*
> WordDelimiterFilterFactory indexes the following word parts:
> * X-002-99-495
> * X (shouldn't be there)
> * 00299495 (shouldn't be there)
> * X00299495
> But the 'X' should not be indexed or queried as a single term. You can see 
> that splitting is completely deactivated in the schema.
> I can move the charater part around in the search term:
> Searching for *002-abc-99-495* gives me
> * 002-abc-99-495 
> * 002 (shouldn't be there)
> * abc (shouldn't be there)
> * 99495 (shouldn't be there)
> * 002abc99495
> Please have a look at the screenshot.
> This is not what I expect from the configuration! I think this must be a bug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-5921) WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false)

Reply via email to