Re: [xwiki-devs] [Solr] Word delimiter filter on English text

Marius Dumitru Florea Thu, 07 May 2015 06:50:42 -0700

On Tue, May 5, 2015 at 3:12 PM, Eduard Moraru <[email protected]> wrote:
> Hi,
>


> The question is about content fields (document contet, textarea content,
> etc.) and not about the document's space name and document name fields,
> which will still match in both approaches, right?

The question is about the fields that are indexed depending on the
document locale.

>
> As far as I`ve understood it, text_en gets less matches than
> text_en_splitting, but text_en has better support for cases where in
> text_en_splitting you would have to use a phrase query to get the match
> (e.g. "Blog.News", "xwiki.com", etc.).

With text_en_splitting a search for "Blog.News" will also match "blog
news" because the phrase from the query is analyzed in the same way it
would have been indexed.

>
> IMO, text_en_splitting sounds more adapted to real life uses and to the
> fuzziness of user queries. If we want explicit matches for "xwiki.com" or
> "Blog.News" within a document's content, phrase queries can still be used,
> right? (i.e. quoting the explicit string).
>
> Thanks,
> Eduard
>
>
> On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea <
> [email protected]> wrote:
>
>> Hi guys,
>>
>> I just noticed (while updating the screen shots for the Solr Search UI
>> documentation [1]) that searching for "blog" doesn't match "Blog.News"
>> from the category of BlogIntroduction any more as indicated in [2].
>>
>> Debug mode view shows me that "Blog.News" is indexed as "blog.new"
>> which means the text is not split in "blog" and "news" as I would have
>> expected in this case.
>>
>> After checking the Solr schema configuration I understood that this is
>> normal considering that we use the Standard Tokenizer [3] for English
>> text which has this exception:
>>
>> "Periods (dots) that are not followed by whitespace are kept as part
>> of the token, including Internet domain names."
>>
>> Further investigation showed that before 6.0M1 we used the Word
>> Delimiter Filter [4] for English text but I changed this with
>> XWIKI-8911 when upgrading to Solr 4.7.0.
>>
>> I then noticed that the Solr schema has both text_en and
>> text_en_splitting types, the later with this comment:
>>
>> A text field with defaults appropriate for English, plus aggressive
>> word-splitting and autophrase features enabled. This field is just
>> like text_en, except it adds WordDelimiterFilter to enable splitting
>> and matching of words on case-change, alpha numeric boundaries, and
>> non-alphanumeric chars. This means certain compound word cases will
>> work, for example query "wi fi" will match document "WiFi" or "wi-fi".
>>
>> So in case someone wants to use this type instead for English text he
>> needs to change the type in:
>>
>> <dynamicField name="*_en" type="text_en" indexed="true" stored="true"
>> multiValued="true" />
>>
>> The question is whether we should use this type by default or not. As
>> explained in the comment above, there are downsides.
>>
>> Thanks,
>> Marius
>>
>> [1]
>> http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application
>> [2]
>> http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png
>> [3]
>> https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer
>> [4]
>> https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
>> _______________________________________________
>> devs mailing list
>> [email protected]
>> http://lists.xwiki.org/mailman/listinfo/devs
>>
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [Solr] Word delimiter filter on English text

Reply via email to