Re: [xwiki-devs] [Solr] Word delimiter filter on English text
Eddy, We want both or? Dies the query not use edismax? If yes, we should make it search the field text_en with higher weight than text_en_splitting by setting the boost parameter to text_en^2 text_eb_splitting^1 Or? Paul -- fat fingered on my z10 -- Message d'origine De: Eduard Moraru Envoyé: Dienstag, 5. Mai 2015 14:13 À: XWiki Developers Répondre à: XWiki Developers Objet: Re: [xwiki-devs] [Solr] Word delimiter filter on English text Hi, The question is about content fields (document contet, textarea content, etc.) and not about the document's space name and document name fields, which will still match in both approaches, right? As far as I`ve understood it, text_en gets less matches than text_en_splitting, but text_en has better support for cases where in text_en_splitting you would have to use a phrase query to get the match (e.g. Blog.News, xwiki.com, etc.). IMO, text_en_splitting sounds more adapted to real life uses and to the fuzziness of user queries. If we want explicit matches for xwiki.com or Blog.News within a document's content, phrase queries can still be used, right? (i.e. quoting the explicit string). Thanks, Eduard On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea mariusdumitru.flo...@xwiki.com wrote: Hi guys, I just noticed (while updating the screen shots for the Solr Search UI documentation [1]) that searching for blog doesn't match Blog.News from the category of BlogIntroduction any more as indicated in [2]. Debug mode view shows me that Blog.News is indexed as blog.new which means the text is not split in blog and news as I would have expected in this case. After checking the Solr schema configuration I understood that this is normal considering that we use the Standard Tokenizer [3] for English text which has this exception: Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names. Further investigation showed that before 6.0M1 we used the Word Delimiter Filter [4] for English text but I changed this with XWIKI-8911 when upgrading to Solr 4.7.0. I then noticed that the Solr schema has both text_en and text_en_splitting types, the later with this comment: A text field with defaults appropriate for English, plus aggressive word-splitting and autophrase features enabled. This field is just like text_en, except it adds WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars. This means certain compound word cases will work, for example query wi fi will match document WiFi or wi-fi. So in case someone wants to use this type instead for English text he needs to change the type in: dynamicField name=*_en type=text_en indexed=true stored=true multiValued=true / The question is whether we should use this type by default or not. As explained in the comment above, there are downsides. Thanks, Marius [1] http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application [2] http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png [3] https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer [4] https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs
[xwiki-devs] DocumentAccessBridge throws java.lang.Exception
Dear XWiki devs I recently wanted to refactor the usage of XWikiContext and XWikiDocument in my code with DocumentAccessBridge and DocumentModelBridge. I however was suprised that many methods in DocumentAccessBridge throw java.lang.Exception. This is generally considered bad practice since it forces all clients to catch java.lang.Exception and therefore also unchecked Exceptions - from which the client potentially can't or shouldn't recover. For in depth arguments see e.g. Joshua Bloch's Effective Java where he states to use checked exceptions for recoverable conditions and runtime exceptions for programming errors, throw exceptions appropriate to the abstraction and always declare checked exceptions individually. Can somebody explain to me the reasoning behind throwing java.lang.Exception in this interface? Are there any plans to deprecate and replace it by an interface without this caveat? Thanks Marc synventis GmbH ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs
Re: [xwiki-devs] [Solr] Word delimiter filter on English text
I agree with Paul. The way I usually do searches is: - each field gets indexed several times, including: -- exact matches ^5n (field == query) -- prefix matches ^1.5n (field ^= query) -- same spelling ^1.8n (query words in field) -- fuzzy matching ^n (aggressive tokenization and stemming) -- stub matching ^.5n (query tokens are prefixes of indexed tokens) -- and three catch-all fields where every other field gets copied, with spelling, fuzzy and stub variants - where n is a factor based on the field's importance: page title and name have the highest boost, a catch-all field has the lowest boost - search with edismax, pf with double the boost (2n) on exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub On 05/05/2015 08:28 AM, Paul Libbrecht wrote: Eddy, We want both or? Dies the query not use edismax? If yes, we should make it search the field text_en with higher weight than text_en_splitting by setting the boost parameter to text_en^2 text_eb_splitting^1 Or? Paul -- fat fingered on my z10 -- Message d'origine De: Eduard Moraru Envoyé: Dienstag, 5. Mai 2015 14:13 À: XWiki Developers Répondre à: XWiki Developers Objet: Re: [xwiki-devs] [Solr] Word delimiter filter on English text Hi, The question is about content fields (document contet, textarea content, etc.) and not about the document's space name and document name fields, which will still match in both approaches, right? As far as I`ve understood it, text_en gets less matches than text_en_splitting, but text_en has better support for cases where in text_en_splitting you would have to use a phrase query to get the match (e.g. Blog.News, xwiki.com, etc.). IMO, text_en_splitting sounds more adapted to real life uses and to the fuzziness of user queries. If we want explicit matches for xwiki.com or Blog.News within a document's content, phrase queries can still be used, right? (i.e. quoting the explicit string). Thanks, Eduard On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea mariusdumitru.flo...@xwiki.com wrote: Hi guys, I just noticed (while updating the screen shots for the Solr Search UI documentation [1]) that searching for blog doesn't match Blog.News from the category of BlogIntroduction any more as indicated in [2]. Debug mode view shows me that Blog.News is indexed as blog.new which means the text is not split in blog and news as I would have expected in this case. After checking the Solr schema configuration I understood that this is normal considering that we use the Standard Tokenizer [3] for English text which has this exception: Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names. Further investigation showed that before 6.0M1 we used the Word Delimiter Filter [4] for English text but I changed this with XWIKI-8911 when upgrading to Solr 4.7.0. I then noticed that the Solr schema has both text_en and text_en_splitting types, the later with this comment: A text field with defaults appropriate for English, plus aggressive word-splitting and autophrase features enabled. This field is just like text_en, except it adds WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars. This means certain compound word cases will work, for example query wi fi will match document WiFi or wi-fi. So in case someone wants to use this type instead for English text he needs to change the type in: dynamicField name=*_en type=text_en indexed=true stored=true multiValued=true / The question is whether we should use this type by default or not. As explained in the comment above, there are downsides. Thanks, Marius [1] http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application [2] http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png [3] https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer [4] https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs -- Sergiu Dumitriu http://purl.org/net/sergiu/ ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs
Re: [xwiki-devs] [Solr] Word delimiter filter on English text
Hi, The question is about content fields (document contet, textarea content, etc.) and not about the document's space name and document name fields, which will still match in both approaches, right? As far as I`ve understood it, text_en gets less matches than text_en_splitting, but text_en has better support for cases where in text_en_splitting you would have to use a phrase query to get the match (e.g. Blog.News, xwiki.com, etc.). IMO, text_en_splitting sounds more adapted to real life uses and to the fuzziness of user queries. If we want explicit matches for xwiki.com or Blog.News within a document's content, phrase queries can still be used, right? (i.e. quoting the explicit string). Thanks, Eduard On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea mariusdumitru.flo...@xwiki.com wrote: Hi guys, I just noticed (while updating the screen shots for the Solr Search UI documentation [1]) that searching for blog doesn't match Blog.News from the category of BlogIntroduction any more as indicated in [2]. Debug mode view shows me that Blog.News is indexed as blog.new which means the text is not split in blog and news as I would have expected in this case. After checking the Solr schema configuration I understood that this is normal considering that we use the Standard Tokenizer [3] for English text which has this exception: Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names. Further investigation showed that before 6.0M1 we used the Word Delimiter Filter [4] for English text but I changed this with XWIKI-8911 when upgrading to Solr 4.7.0. I then noticed that the Solr schema has both text_en and text_en_splitting types, the later with this comment: A text field with defaults appropriate for English, plus aggressive word-splitting and autophrase features enabled. This field is just like text_en, except it adds WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars. This means certain compound word cases will work, for example query wi fi will match document WiFi or wi-fi. So in case someone wants to use this type instead for English text he needs to change the type in: dynamicField name=*_en type=text_en indexed=true stored=true multiValued=true / The question is whether we should use this type by default or not. As explained in the comment above, there are downsides. Thanks, Marius [1] http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application [2] http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png [3] https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer [4] https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs
Re: [xwiki-devs] [XWiki Day] BFD#87
Results: http://www.xwiki.org/xwiki/bin/view/Blog/Bug+Fixing+Day+87 Thanks, Eduard On Thu, Apr 30, 2015 at 12:10 PM, Eduard Moraru enygma2...@gmail.com wrote: Hi devs, Today is BFD#87: http://dev.xwiki.org/xwiki/bin/view/Community/XWikiDays#HBugfixingdays Our current status for the 1 year period is 66 bugs behind. See: http://jira.xwiki.org/secure/Dashboard.jspa?selectPageId=1#Created-vs-Resolved-Chart/10470 Here's the BFD#86 dashboard to follow the progress during the day: http://jira.xwiki.org/secure/Dashboard.jspa?selectPageId=13198 Thanks and enjoy your bug fixing day! -Eduard ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs
[xwiki-devs] [Solr] Word delimiter filter on English text
Hi guys, I just noticed (while updating the screen shots for the Solr Search UI documentation [1]) that searching for blog doesn't match Blog.News from the category of BlogIntroduction any more as indicated in [2]. Debug mode view shows me that Blog.News is indexed as blog.new which means the text is not split in blog and news as I would have expected in this case. After checking the Solr schema configuration I understood that this is normal considering that we use the Standard Tokenizer [3] for English text which has this exception: Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names. Further investigation showed that before 6.0M1 we used the Word Delimiter Filter [4] for English text but I changed this with XWIKI-8911 when upgrading to Solr 4.7.0. I then noticed that the Solr schema has both text_en and text_en_splitting types, the later with this comment: A text field with defaults appropriate for English, plus aggressive word-splitting and autophrase features enabled. This field is just like text_en, except it adds WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars. This means certain compound word cases will work, for example query wi fi will match document WiFi or wi-fi. So in case someone wants to use this type instead for English text he needs to change the type in: dynamicField name=*_en type=text_en indexed=true stored=true multiValued=true / The question is whether we should use this type by default or not. As explained in the comment above, there are downsides. Thanks, Marius [1] http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application [2] http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png [3] https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer [4] https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs