On Tue, May 5, 2015 at 3:12 PM, Eduard Moraru <[email protected]> wrote: > Hi, >
> The question is about content fields (document contet, textarea content, > etc.) and not about the document's space name and document name fields, > which will still match in both approaches, right? The question is about the fields that are indexed depending on the document locale. > > As far as I`ve understood it, text_en gets less matches than > text_en_splitting, but text_en has better support for cases where in > text_en_splitting you would have to use a phrase query to get the match > (e.g. "Blog.News", "xwiki.com", etc.). With text_en_splitting a search for "Blog.News" will also match "blog news" because the phrase from the query is analyzed in the same way it would have been indexed. > > IMO, text_en_splitting sounds more adapted to real life uses and to the > fuzziness of user queries. If we want explicit matches for "xwiki.com" or > "Blog.News" within a document's content, phrase queries can still be used, > right? (i.e. quoting the explicit string). > > Thanks, > Eduard > > > On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea < > [email protected]> wrote: > >> Hi guys, >> >> I just noticed (while updating the screen shots for the Solr Search UI >> documentation [1]) that searching for "blog" doesn't match "Blog.News" >> from the category of BlogIntroduction any more as indicated in [2]. >> >> Debug mode view shows me that "Blog.News" is indexed as "blog.new" >> which means the text is not split in "blog" and "news" as I would have >> expected in this case. >> >> After checking the Solr schema configuration I understood that this is >> normal considering that we use the Standard Tokenizer [3] for English >> text which has this exception: >> >> "Periods (dots) that are not followed by whitespace are kept as part >> of the token, including Internet domain names." >> >> Further investigation showed that before 6.0M1 we used the Word >> Delimiter Filter [4] for English text but I changed this with >> XWIKI-8911 when upgrading to Solr 4.7.0. >> >> I then noticed that the Solr schema has both text_en and >> text_en_splitting types, the later with this comment: >> >> A text field with defaults appropriate for English, plus aggressive >> word-splitting and autophrase features enabled. This field is just >> like text_en, except it adds WordDelimiterFilter to enable splitting >> and matching of words on case-change, alpha numeric boundaries, and >> non-alphanumeric chars. This means certain compound word cases will >> work, for example query "wi fi" will match document "WiFi" or "wi-fi". >> >> So in case someone wants to use this type instead for English text he >> needs to change the type in: >> >> <dynamicField name="*_en" type="text_en" indexed="true" stored="true" >> multiValued="true" /> >> >> The question is whether we should use this type by default or not. As >> explained in the comment above, there are downsides. >> >> Thanks, >> Marius >> >> [1] >> http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application >> [2] >> http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png >> [3] >> https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer >> [4] >> https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter >> _______________________________________________ >> devs mailing list >> [email protected] >> http://lists.xwiki.org/mailman/listinfo/devs >> > _______________________________________________ > devs mailing list > [email protected] > http://lists.xwiki.org/mailman/listinfo/devs _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

