I agree with Paul. The way I usually do searches is:
- each field gets indexed several times, including: -- exact matches ^5n (field == query) -- prefix matches ^1.5n (field ^= query) -- same spelling ^1.8n (query words in field) -- fuzzy matching ^n (aggressive tokenization and stemming) -- stub matching ^.5n (query tokens are prefixes of indexed tokens) -- and three catch-all fields where every other field gets copied, with spelling, fuzzy and stub variants - where n is a factor based on the field's importance: page title and name have the highest boost, a catch-all field has the lowest boost - search with edismax, pf with double the boost (2n) on exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub On 05/05/2015 08:28 AM, Paul Libbrecht wrote: > Eddy, > We want both or? > Dies the query not use edismax? > If yes, we should make it search the field text_en with higher weight than > text_en_splitting by setting the boost parameter to > text_en^2 text_eb_splitting^1 > Or? > Paul > > > -- fat fingered on my z10 -- > Message d'origine > De: Eduard Moraru > Envoyé: Dienstag, 5. Mai 2015 14:13 > À: XWiki Developers > Répondre à: XWiki Developers > Objet: Re: [xwiki-devs] [Solr] Word delimiter filter on English text > > Hi, > > The question is about content fields (document contet, textarea content, > etc.) and not about the document's space name and document name fields, > which will still match in both approaches, right? > > As far as I`ve understood it, text_en gets less matches than > text_en_splitting, but text_en has better support for cases where in > text_en_splitting you would have to use a phrase query to get the match > (e.g. "Blog.News", "xwiki.com", etc.). > > IMO, text_en_splitting sounds more adapted to real life uses and to the > fuzziness of user queries. If we want explicit matches for "xwiki.com" or > "Blog.News" within a document's content, phrase queries can still be used, > right? (i.e. quoting the explicit string). > > Thanks, > Eduard > > > On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea < > mariusdumitru.flo...@xwiki.com> wrote: > >> Hi guys, >> >> I just noticed (while updating the screen shots for the Solr Search UI >> documentation [1]) that searching for "blog" doesn't match "Blog.News" >> from the category of BlogIntroduction any more as indicated in [2]. >> >> Debug mode view shows me that "Blog.News" is indexed as "blog.new" >> which means the text is not split in "blog" and "news" as I would have >> expected in this case. >> >> After checking the Solr schema configuration I understood that this is >> normal considering that we use the Standard Tokenizer [3] for English >> text which has this exception: >> >> "Periods (dots) that are not followed by whitespace are kept as part >> of the token, including Internet domain names." >> >> Further investigation showed that before 6.0M1 we used the Word >> Delimiter Filter [4] for English text but I changed this with >> XWIKI-8911 when upgrading to Solr 4.7.0. >> >> I then noticed that the Solr schema has both text_en and >> text_en_splitting types, the later with this comment: >> >> A text field with defaults appropriate for English, plus aggressive >> word-splitting and autophrase features enabled. This field is just >> like text_en, except it adds WordDelimiterFilter to enable splitting >> and matching of words on case-change, alpha numeric boundaries, and >> non-alphanumeric chars. This means certain compound word cases will >> work, for example query "wi fi" will match document "WiFi" or "wi-fi". >> >> So in case someone wants to use this type instead for English text he >> needs to change the type in: >> >> <dynamicField name="*_en" type="text_en" indexed="true" stored="true" >> multiValued="true" /> >> >> The question is whether we should use this type by default or not. As >> explained in the comment above, there are downsides. >> >> Thanks, >> Marius >> >> [1] >> http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application >> [2] >> http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png >> [3] >> https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer >> [4] >> https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter >> _______________________________________________ >> devs mailing list >> devs@xwiki.org >> http://lists.xwiki.org/mailman/listinfo/devs >> > _______________________________________________ > devs mailing list > devs@xwiki.org > http://lists.xwiki.org/mailman/listinfo/devs > _______________________________________________ > devs mailing list > devs@xwiki.org > http://lists.xwiki.org/mailman/listinfo/devs > -- Sergiu Dumitriu http://purl.org/net/sergiu/ _______________________________________________ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs