[
https://issues.apache.org/jira/browse/SOLR-5921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Malte Hübner updated SOLR-5921:
-------------------------------
Description:
WordDelimiterFilterFactory generates word parts although splitting
configuration is deactivated.
This is the fieldType setup:
{code}
<fieldType name="text_de" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer
class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
stemEnglishPossessive="0" generateWordParts="0" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"
splitOnNumerics="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer
class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory"
synonyms="lang/synonyms_de.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
{code}
The given search term is: *X-002-99-495*
WordDelimiterFilterFactory indexes the following word parts:
* X-002-99-495
* X (shouldn't be there)
* 00299495 (shouldn't be there)
* X00299495
But the 'X' should not be indexed or queried as a single term. You can see that
splitting is completely deactivated in the schema.
I can move the charater part around in the search term:
Searching for *002-abc-99-495* gives me
* 002-abc-99-495
* 002 (shouldn't be there)
* abc (shouldn't be there)
* 99495 (shouldn't be there)
* 002abc99495
Please have a look at the screenshot.
This is not what I expect from the configuration! I think this must be a bug.
was:
WordDelimiterFilterFactory generates word parts although splitting
configuration is deactivatet.
This is the fieldType setup:
{code}
<fieldType name="text_de" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer
class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
stemEnglishPossessive="0" generateWordParts="0" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"
splitOnNumerics="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer
class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory"
synonyms="lang/synonyms_de.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
{code}
The given search term is: *X-002-99-495*
WordDelimiterFilterFactory indexes the following word parts:
* X-002-99-495
* X (shouldn't be there)
* 00299495 (shouldn't be there)
* X00299495
But the 'X' should not be indexed or queried as a single term. You can see that
splitting is completely deactivated in the schema.
I can move the charater part around in the search term:
Searching for *002-abc-99-495* gives me
* 002-abc-99-495
* 002 (shouldn't be there)
* abc (shouldn't be there)
* 99495 (shouldn't be there)
* 002abc99495
Please have a look at the screenshot.
This is not what I expect from the configuration! I think this must be a bug.
> WordDelimiterFilterFactory splits up hyphenated terms although
> splitOnNumerics, generateWordParts and generateNumberParts are set to 0
> (false)
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-5921
> URL: https://issues.apache.org/jira/browse/SOLR-5921
> Project: Solr
> Issue Type: Bug
> Components: Schema and Analysis
> Affects Versions: 4.7
> Reporter: Malte Hübner
> Fix For: 4.7.1
>
> Attachments: 2014-03-27 09_50_33-Solr Admin.png
>
>
> WordDelimiterFilterFactory generates word parts although splitting
> configuration is deactivated.
> This is the fieldType setup:
> {code}
> <fieldType name="text_de" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer
> class="solr.WhitespaceTokenizerFactory" />
> <filter class="solr.StopFilterFactory"
> ignoreCase="true" words="lang/stopwords_de.txt"
> enablePositionIncrements="true" />
> <filter class="solr.WordDelimiterFilterFactory"
> stemEnglishPossessive="0" generateWordParts="0" generateNumberParts="0"
> catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"
> splitOnNumerics="0" preserveOriginal="1"/>
> <filter class="solr.LowerCaseFilterFactory" />
> </analyzer>
> <analyzer type="query">
> <tokenizer
> class="solr.WhitespaceTokenizerFactory" />
> <filter class="solr.SynonymFilterFactory"
> synonyms="lang/synonyms_de.txt" ignoreCase="true" expand="true" />
> <filter class="solr.StopFilterFactory"
> ignoreCase="true" words="lang/stopwords_de.txt"
> enablePositionIncrements="true" />
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
> preserveOriginal="1"/>
> <filter class="solr.LowerCaseFilterFactory" />
> </analyzer>
> </fieldType>
> {code}
> The given search term is: *X-002-99-495*
> WordDelimiterFilterFactory indexes the following word parts:
> * X-002-99-495
> * X (shouldn't be there)
> * 00299495 (shouldn't be there)
> * X00299495
> But the 'X' should not be indexed or queried as a single term. You can see
> that splitting is completely deactivated in the schema.
> I can move the charater part around in the search term:
> Searching for *002-abc-99-495* gives me
> * 002-abc-99-495
> * 002 (shouldn't be there)
> * abc (shouldn't be there)
> * 99495 (shouldn't be there)
> * 002abc99495
> Please have a look at the screenshot.
> This is not what I expect from the configuration! I think this must be a bug.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]