Hi Sergiu, Can you tell us the effect on the index size (and speed in the end) if each field (e.g. document title, a String or TextArea property) is indexed in 5 different ways (5 separate fields in the index)? It is worth having this configuration by default?
Thanks, Marius On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu <[email protected]> wrote: > I agree with Paul. > > The way I usually do searches is: > > - each field gets indexed several times, including: > -- exact matches ^5n (field == query) > -- prefix matches ^1.5n (field ^= query) > -- same spelling ^1.8n (query words in field) > -- fuzzy matching ^n (aggressive tokenization and stemming) > -- stub matching ^.5n (query tokens are prefixes of indexed tokens) > -- and three catch-all fields where every other field gets copied, with > spelling, fuzzy and stub variants > - where n is a factor based on the field's importance: page title and > name have the highest boost, a catch-all field has the lowest boost > - search with edismax, pf with double the boost (2n) on > exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub > > On 05/05/2015 08:28 AM, Paul Libbrecht wrote: >> Eddy, >> We want both or? >> Dies the query not use edismax? >> If yes, we should make it search the field text_en with higher weight than >> text_en_splitting by setting the boost parameter to >> text_en^2 text_eb_splitting^1 >> Or? >> Paul >> >> >> -- fat fingered on my z10 -- >> Message d'origine >> De: Eduard Moraru >> Envoyé: Dienstag, 5. Mai 2015 14:13 >> À: XWiki Developers >> Répondre à: XWiki Developers >> Objet: Re: [xwiki-devs] [Solr] Word delimiter filter on English text >> >> Hi, >> >> The question is about content fields (document contet, textarea content, >> etc.) and not about the document's space name and document name fields, >> which will still match in both approaches, right? >> >> As far as I`ve understood it, text_en gets less matches than >> text_en_splitting, but text_en has better support for cases where in >> text_en_splitting you would have to use a phrase query to get the match >> (e.g. "Blog.News", "xwiki.com", etc.). >> >> IMO, text_en_splitting sounds more adapted to real life uses and to the >> fuzziness of user queries. If we want explicit matches for "xwiki.com" or >> "Blog.News" within a document's content, phrase queries can still be used, >> right? (i.e. quoting the explicit string). >> >> Thanks, >> Eduard >> >> >> On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea < >> [email protected]> wrote: >> >>> Hi guys, >>> >>> I just noticed (while updating the screen shots for the Solr Search UI >>> documentation [1]) that searching for "blog" doesn't match "Blog.News" >>> from the category of BlogIntroduction any more as indicated in [2]. >>> >>> Debug mode view shows me that "Blog.News" is indexed as "blog.new" >>> which means the text is not split in "blog" and "news" as I would have >>> expected in this case. >>> >>> After checking the Solr schema configuration I understood that this is >>> normal considering that we use the Standard Tokenizer [3] for English >>> text which has this exception: >>> >>> "Periods (dots) that are not followed by whitespace are kept as part >>> of the token, including Internet domain names." >>> >>> Further investigation showed that before 6.0M1 we used the Word >>> Delimiter Filter [4] for English text but I changed this with >>> XWIKI-8911 when upgrading to Solr 4.7.0. >>> >>> I then noticed that the Solr schema has both text_en and >>> text_en_splitting types, the later with this comment: >>> >>> A text field with defaults appropriate for English, plus aggressive >>> word-splitting and autophrase features enabled. This field is just >>> like text_en, except it adds WordDelimiterFilter to enable splitting >>> and matching of words on case-change, alpha numeric boundaries, and >>> non-alphanumeric chars. This means certain compound word cases will >>> work, for example query "wi fi" will match document "WiFi" or "wi-fi". >>> >>> So in case someone wants to use this type instead for English text he >>> needs to change the type in: >>> >>> <dynamicField name="*_en" type="text_en" indexed="true" stored="true" >>> multiValued="true" /> >>> >>> The question is whether we should use this type by default or not. As >>> explained in the comment above, there are downsides. >>> >>> Thanks, >>> Marius >>> >>> [1] >>> http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application >>> [2] >>> http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png >>> [3] >>> https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer >>> [4] >>> https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter >>> _______________________________________________ >>> devs mailing list >>> [email protected] >>> http://lists.xwiki.org/mailman/listinfo/devs >>> >> _______________________________________________ >> devs mailing list >> [email protected] >> http://lists.xwiki.org/mailman/listinfo/devs >> _______________________________________________ >> devs mailing list >> [email protected] >> http://lists.xwiki.org/mailman/listinfo/devs >> > > > -- > Sergiu Dumitriu > http://purl.org/net/sergiu/ > _______________________________________________ > devs mailing list > [email protected] > http://lists.xwiki.org/mailman/listinfo/devs _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

