Re: [xwiki-devs] [Solr] Word delimiter filter on English text

Sergiu Dumitriu Tue, 05 May 2015 06:56:43 -0700

I agree with Paul.

The way I usually do searches is:


- each field gets indexed several times, including:
-- exact matches ^5n (field == query)
-- prefix matches ^1.5n (field ^= query)
-- same spelling ^1.8n (query words in field)
-- fuzzy matching ^n (aggressive tokenization and stemming)
-- stub matching ^.5n (query tokens are prefixes of indexed tokens)
-- and three catch-all fields where every other field gets copied, with
spelling, fuzzy and stub variants
- where n is a factor based on the field's importance: page title and
name have the highest boost, a catch-all field has the lowest boost
- search with edismax, pf with double the boost (2n) on
exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub

On 05/05/2015 08:28 AM, Paul Libbrecht wrote:
> Eddy,
> We want both or?
> Dies the query not use edismax? 
> If yes, we should make it search the field text_en with higher weight than 
> text_en_splitting by setting the boost parameter to
> ‎ text_en^2 text_eb_splitting^1
> Or?
> Paul
> 
> 
> -- fat fingered on my z10 --
>   Message d'origine  
> De: Eduard Moraru
> Envoyé: Dienstag, 5. Mai 2015 14:13
> À: XWiki Developers
> Répondre à: XWiki Developers
> Objet: Re: [xwiki-devs] [Solr] Word delimiter filter on English text
> 
> Hi,
> 
> The question is about content fields (document contet, textarea content,
> etc.) and not about the document's space name and document name fields,
> which will still match in both approaches, right?
> 
> As far as I`ve understood it, text_en gets less matches than
> text_en_splitting, but text_en has better support for cases where in
> text_en_splitting you would have to use a phrase query to get the match
> (e.g. "Blog.News", "xwiki.com", etc.).
> 
> IMO, text_en_splitting sounds more adapted to real life uses and to the
> fuzziness of user queries. If we want explicit matches for "xwiki.com" or
> "Blog.News" within a document's content, phrase queries can still be used,
> right? (i.e. quoting the explicit string).
> 
> Thanks,
> Eduard
> 
> 
> On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea <
> mariusdumitru.flo...@xwiki.com> wrote:
> 
>> Hi guys,
>>
>> I just noticed (while updating the screen shots for the Solr Search UI
>> documentation [1]) that searching for "blog" doesn't match "Blog.News"
>> from the category of BlogIntroduction any more as indicated in [2].
>>
>> Debug mode view shows me that "Blog.News" is indexed as "blog.new"
>> which means the text is not split in "blog" and "news" as I would have
>> expected in this case.
>>
>> After checking the Solr schema configuration I understood that this is
>> normal considering that we use the Standard Tokenizer [3] for English
>> text which has this exception:
>>
>> "Periods (dots) that are not followed by whitespace are kept as part
>> of the token, including Internet domain names."
>>
>> Further investigation showed that before 6.0M1 we used the Word
>> Delimiter Filter [4] for English text but I changed this with
>> XWIKI-8911 when upgrading to Solr 4.7.0.
>>
>> I then noticed that the Solr schema has both text_en and
>> text_en_splitting types, the later with this comment:
>>
>> A text field with defaults appropriate for English, plus aggressive
>> word-splitting and autophrase features enabled. This field is just
>> like text_en, except it adds WordDelimiterFilter to enable splitting
>> and matching of words on case-change, alpha numeric boundaries, and
>> non-alphanumeric chars. This means certain compound word cases will
>> work, for example query "wi fi" will match document "WiFi" or "wi-fi".
>>
>> So in case someone wants to use this type instead for English text he
>> needs to change the type in:
>>
>> <dynamicField name="*_en" type="text_en" indexed="true" stored="true"
>> multiValued="true" />
>>
>> The question is whether we should use this type by default or not. As
>> explained in the comment above, there are downsides.
>>
>> Thanks,
>> Marius
>>
>> [1]
>> http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application
>> [2]
>> http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png
>> [3]
>> https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer
>> [4]
>> https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
>> _______________________________________________
>> devs mailing list
>> devs@xwiki.org
>> http://lists.xwiki.org/mailman/listinfo/devs
>>
> _______________________________________________
> devs mailing list
> devs@xwiki.org
> http://lists.xwiki.org/mailman/listinfo/devs
> _______________________________________________
> devs mailing list
> devs@xwiki.org
> http://lists.xwiki.org/mailman/listinfo/devs
> 


-- 
Sergiu Dumitriu
http://purl.org/net/sergiu/
_______________________________________________
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [Solr] Word delimiter filter on English text

Reply via email to