Re: [xwiki-devs] [Solr] Word delimiter filter on English text

Marius Dumitru Florea Thu, 07 May 2015 07:04:05 -0700

Hi Sergiu,

Can you tell us the effect on the index size (and speed in the end) if
each field (e.g. document title, a String or TextArea property) is
indexed in 5 different ways (5 separate fields in the index)? It is
worth having this configuration by default?


Thanks,
Marius

On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu <[email protected]> wrote:
> I agree with Paul.
>
> The way I usually do searches is:
>
> - each field gets indexed several times, including:
> -- exact matches ^5n (field == query)
> -- prefix matches ^1.5n (field ^= query)
> -- same spelling ^1.8n (query words in field)
> -- fuzzy matching ^n (aggressive tokenization and stemming)
> -- stub matching ^.5n (query tokens are prefixes of indexed tokens)
> -- and three catch-all fields where every other field gets copied, with
> spelling, fuzzy and stub variants
> - where n is a factor based on the field's importance: page title and
> name have the highest boost, a catch-all field has the lowest boost
> - search with edismax, pf with double the boost (2n) on
> exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub
>
> On 05/05/2015 08:28 AM, Paul Libbrecht wrote:
>> Eddy,
>> We want both or?
>> Dies the query not use edismax?
>> If yes, we should make it search the field text_en with higher weight than 
>> text_en_splitting by setting the boost parameter to
>> ‎ text_en^2 text_eb_splitting^1
>> Or?
>> Paul
>>
>>
>> -- fat fingered on my z10 --
>>   Message d'origine
>> De: Eduard Moraru
>> Envoyé: Dienstag, 5. Mai 2015 14:13
>> À: XWiki Developers
>> Répondre à: XWiki Developers
>> Objet: Re: [xwiki-devs] [Solr] Word delimiter filter on English text
>>
>> Hi,
>>
>> The question is about content fields (document contet, textarea content,
>> etc.) and not about the document's space name and document name fields,
>> which will still match in both approaches, right?
>>
>> As far as I`ve understood it, text_en gets less matches than
>> text_en_splitting, but text_en has better support for cases where in
>> text_en_splitting you would have to use a phrase query to get the match
>> (e.g. "Blog.News", "xwiki.com", etc.).
>>
>> IMO, text_en_splitting sounds more adapted to real life uses and to the
>> fuzziness of user queries. If we want explicit matches for "xwiki.com" or
>> "Blog.News" within a document's content, phrase queries can still be used,
>> right? (i.e. quoting the explicit string).
>>
>> Thanks,
>> Eduard
>>
>>
>> On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea <
>> [email protected]> wrote:
>>
>>> Hi guys,
>>>
>>> I just noticed (while updating the screen shots for the Solr Search UI
>>> documentation [1]) that searching for "blog" doesn't match "Blog.News"
>>> from the category of BlogIntroduction any more as indicated in [2].
>>>
>>> Debug mode view shows me that "Blog.News" is indexed as "blog.new"
>>> which means the text is not split in "blog" and "news" as I would have
>>> expected in this case.
>>>
>>> After checking the Solr schema configuration I understood that this is
>>> normal considering that we use the Standard Tokenizer [3] for English
>>> text which has this exception:
>>>
>>> "Periods (dots) that are not followed by whitespace are kept as part
>>> of the token, including Internet domain names."
>>>
>>> Further investigation showed that before 6.0M1 we used the Word
>>> Delimiter Filter [4] for English text but I changed this with
>>> XWIKI-8911 when upgrading to Solr 4.7.0.
>>>
>>> I then noticed that the Solr schema has both text_en and
>>> text_en_splitting types, the later with this comment:
>>>
>>> A text field with defaults appropriate for English, plus aggressive
>>> word-splitting and autophrase features enabled. This field is just
>>> like text_en, except it adds WordDelimiterFilter to enable splitting
>>> and matching of words on case-change, alpha numeric boundaries, and
>>> non-alphanumeric chars. This means certain compound word cases will
>>> work, for example query "wi fi" will match document "WiFi" or "wi-fi".
>>>
>>> So in case someone wants to use this type instead for English text he
>>> needs to change the type in:
>>>
>>> <dynamicField name="*_en" type="text_en" indexed="true" stored="true"
>>> multiValued="true" />
>>>
>>> The question is whether we should use this type by default or not. As
>>> explained in the comment above, there are downsides.
>>>
>>> Thanks,
>>> Marius
>>>
>>> [1]
>>> http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application
>>> [2]
>>> http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png
>>> [3]
>>> https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer
>>> [4]
>>> https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
>>> _______________________________________________
>>> devs mailing list
>>> [email protected]
>>> http://lists.xwiki.org/mailman/listinfo/devs
>>>
>> _______________________________________________
>> devs mailing list
>> [email protected]
>> http://lists.xwiki.org/mailman/listinfo/devs
>> _______________________________________________
>> devs mailing list
>> [email protected]
>> http://lists.xwiki.org/mailman/listinfo/devs
>>
>
>
> --
> Sergiu Dumitriu
> http://purl.org/net/sergiu/
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [Solr] Word delimiter filter on English text

Reply via email to