Re: [xwiki-devs] [Solr] Word delimiter filter on English text

2015-05-08 Thread Marius Dumitru Florea
On Fri, May 8, 2015 at 10:39 AM, Sergiu Dumitriu ser...@xwiki.org wrote:
 Well, my usecase is not the same, since I'm indexing ontologies and the
 end purpose is to find the best matching terms. A few numbers though:

 - 4MB ontology with 11k terms ends up as 16M index (including
 spellcheck, and most fields are also stored), searches take ~40ms
 including the XWiki overhead, ~10ms just in Solr
 - 180MB ontology with 24k terms - 100M index, ~15ms Solr search time

 For smaller indexes, it does seem to use more disk space than the
 source, but Lucene is good at indexing larger data sets, and after a
 while the index grows slower than the data.


 For me it is worth the extra disk space, since every user is amazed by
 how good the search is at finding the relevant terms, overcoming typos,
 synonyms, and abbreviations, plus autocomplete while typing.

You do this for multiple languages or just for English? In other
words, do you have text_fr_splitting, text_es_splitting etc.?

Thanks Sergiu, I'll definitely take this into account.
Marius


 In XWiki, not all fields should be indexed in all the ways, since it
 doesn't make sense to expect an exact match on a large textarea or the
 document content.

 On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote:
 Hi Sergiu,

 Can you tell us the effect on the index size (and speed in the end) if
 each field (e.g. document title, a String or TextArea property) is
 indexed in 5 different ways (5 separate fields in the index)? It is
 worth having this configuration by default?

 Thanks,
 Marius

 On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu ser...@xwiki.org wrote:
 I agree with Paul.

 The way I usually do searches is:

 - each field gets indexed several times, including:
 -- exact matches ^5n (field == query)
 -- prefix matches ^1.5n (field ^= query)
 -- same spelling ^1.8n (query words in field)
 -- fuzzy matching ^n (aggressive tokenization and stemming)
 -- stub matching ^.5n (query tokens are prefixes of indexed tokens)
 -- and three catch-all fields where every other field gets copied, with
 spelling, fuzzy and stub variants
 - where n is a factor based on the field's importance: page title and
 name have the highest boost, a catch-all field has the lowest boost
 - search with edismax, pf with double the boost (2n) on
 exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub

 --
 Sergiu Dumitriu
 http://purl.org/net/sergiu/

 ___
 devs mailing list
 devs@xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs
___
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs


Re: [xwiki-devs] [Solr] Word delimiter filter on English text

2015-05-08 Thread Paul Libbrecht
Avoid storing everything imperatively!
(this was done earlier in the Lucene plugin and has been the main cause
of the slowness).

My general rule of thumb is that an index is 10% of a text file.
I really would not be scared by indexing 5 different fields of the text.

paul

On 8/05/15 08:39, Sergiu Dumitriu wrote:
 Well, my usecase is not the same, since I'm indexing ontologies and the
 end purpose is to find the best matching terms. A few numbers though:

 - 4MB ontology with 11k terms ends up as 16M index (including
 spellcheck, and most fields are also stored), searches take ~40ms
 including the XWiki overhead, ~10ms just in Solr
 - 180MB ontology with 24k terms - 100M index, ~15ms Solr search time

 For smaller indexes, it does seem to use more disk space than the
 source, but Lucene is good at indexing larger data sets, and after a
 while the index grows slower than the data.

 For me it is worth the extra disk space, since every user is amazed by
 how good the search is at finding the relevant terms, overcoming typos,
 synonyms, and abbreviations, plus autocomplete while typing.

 In XWiki, not all fields should be indexed in all the ways, since it
 doesn't make sense to expect an exact match on a large textarea or the
 document content.

 On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote:
 Hi Sergiu,

 Can you tell us the effect on the index size (and speed in the end) if
 each field (e.g. document title, a String or TextArea property) is
 indexed in 5 different ways (5 separate fields in the index)? It is
 worth having this configuration by default?

 Thanks,
 Marius

 On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu ser...@xwiki.org wrote:
 I agree with Paul.

 The way I usually do searches is:

 - each field gets indexed several times, including:
 -- exact matches ^5n (field == query)
 -- prefix matches ^1.5n (field ^= query)
 -- same spelling ^1.8n (query words in field)
 -- fuzzy matching ^n (aggressive tokenization and stemming)
 -- stub matching ^.5n (query tokens are prefixes of indexed tokens)
 -- and three catch-all fields where every other field gets copied, with
 spelling, fuzzy and stub variants
 - where n is a factor based on the field's importance: page title and
 name have the highest boost, a catch-all field has the lowest boost
 - search with edismax, pf with double the boost (2n) on
 exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub





signature.asc
Description: OpenPGP digital signature
___
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs


Re: [xwiki-devs] [Solr] Word delimiter filter on English text

2015-05-08 Thread Sergiu Dumitriu
Well, my usecase is not the same, since I'm indexing ontologies and the
end purpose is to find the best matching terms. A few numbers though:

- 4MB ontology with 11k terms ends up as 16M index (including
spellcheck, and most fields are also stored), searches take ~40ms
including the XWiki overhead, ~10ms just in Solr
- 180MB ontology with 24k terms - 100M index, ~15ms Solr search time

For smaller indexes, it does seem to use more disk space than the
source, but Lucene is good at indexing larger data sets, and after a
while the index grows slower than the data.

For me it is worth the extra disk space, since every user is amazed by
how good the search is at finding the relevant terms, overcoming typos,
synonyms, and abbreviations, plus autocomplete while typing.

In XWiki, not all fields should be indexed in all the ways, since it
doesn't make sense to expect an exact match on a large textarea or the
document content.

On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote:
 Hi Sergiu,
 
 Can you tell us the effect on the index size (and speed in the end) if
 each field (e.g. document title, a String or TextArea property) is
 indexed in 5 different ways (5 separate fields in the index)? It is
 worth having this configuration by default?
 
 Thanks,
 Marius
 
 On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu ser...@xwiki.org wrote:
 I agree with Paul.

 The way I usually do searches is:

 - each field gets indexed several times, including:
 -- exact matches ^5n (field == query)
 -- prefix matches ^1.5n (field ^= query)
 -- same spelling ^1.8n (query words in field)
 -- fuzzy matching ^n (aggressive tokenization and stemming)
 -- stub matching ^.5n (query tokens are prefixes of indexed tokens)
 -- and three catch-all fields where every other field gets copied, with
 spelling, fuzzy and stub variants
 - where n is a factor based on the field's importance: page title and
 name have the highest boost, a catch-all field has the lowest boost
 - search with edismax, pf with double the boost (2n) on
 exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub

-- 
Sergiu Dumitriu
http://purl.org/net/sergiu/

___
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs


Re: [xwiki-devs] [Solr] Word delimiter filter on English text

2015-05-08 Thread Sergiu Dumitriu
On 05/08/2015 05:22 AM, Marius Dumitru Florea wrote:
 On Fri, May 8, 2015 at 10:39 AM, Sergiu Dumitriu ser...@xwiki.org wrote:
 Well, my usecase is not the same, since I'm indexing ontologies and the
 end purpose is to find the best matching terms. A few numbers though:

 - 4MB ontology with 11k terms ends up as 16M index (including
 spellcheck, and most fields are also stored), searches take ~40ms
 including the XWiki overhead, ~10ms just in Solr
 - 180MB ontology with 24k terms - 100M index, ~15ms Solr search time

 For smaller indexes, it does seem to use more disk space than the
 source, but Lucene is good at indexing larger data sets, and after a
 while the index grows slower than the data.

 
 For me it is worth the extra disk space, since every user is amazed by
 how good the search is at finding the relevant terms, overcoming typos,
 synonyms, and abbreviations, plus autocomplete while typing.
 
 You do this for multiple languages or just for English? In other
 words, do you have text_fr_splitting, text_es_splitting etc.?

At the moment only English.

 Thanks Sergiu, I'll definitely take this into account.
 Marius
 

 In XWiki, not all fields should be indexed in all the ways, since it
 doesn't make sense to expect an exact match on a large textarea or the
 document content.

 On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote:
 Hi Sergiu,

 Can you tell us the effect on the index size (and speed in the end) if
 each field (e.g. document title, a String or TextArea property) is
 indexed in 5 different ways (5 separate fields in the index)? It is
 worth having this configuration by default?

 Thanks,
 Marius

 On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu ser...@xwiki.org wrote:
 I agree with Paul.

 The way I usually do searches is:

 - each field gets indexed several times, including:
 -- exact matches ^5n (field == query)
 -- prefix matches ^1.5n (field ^= query)
 -- same spelling ^1.8n (query words in field)
 -- fuzzy matching ^n (aggressive tokenization and stemming)
 -- stub matching ^.5n (query tokens are prefixes of indexed tokens)
 -- and three catch-all fields where every other field gets copied, with
 spelling, fuzzy and stub variants
 - where n is a factor based on the field's importance: page title and
 name have the highest boost, a catch-all field has the lowest boost
 - search with edismax, pf with double the boost (2n) on
 exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub


-- 
Sergiu Dumitriu
http://purl.org/net/sergiu
___
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs


Re: [xwiki-devs] [Solr] Word delimiter filter on English text

2015-05-07 Thread Marius Dumitru Florea
Hi Sergiu,

Can you tell us the effect on the index size (and speed in the end) if
each field (e.g. document title, a String or TextArea property) is
indexed in 5 different ways (5 separate fields in the index)? It is
worth having this configuration by default?

Thanks,
Marius

On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu ser...@xwiki.org wrote:
 I agree with Paul.

 The way I usually do searches is:

 - each field gets indexed several times, including:
 -- exact matches ^5n (field == query)
 -- prefix matches ^1.5n (field ^= query)
 -- same spelling ^1.8n (query words in field)
 -- fuzzy matching ^n (aggressive tokenization and stemming)
 -- stub matching ^.5n (query tokens are prefixes of indexed tokens)
 -- and three catch-all fields where every other field gets copied, with
 spelling, fuzzy and stub variants
 - where n is a factor based on the field's importance: page title and
 name have the highest boost, a catch-all field has the lowest boost
 - search with edismax, pf with double the boost (2n) on
 exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub

 On 05/05/2015 08:28 AM, Paul Libbrecht wrote:
 Eddy,
 We want both or?
 Dies the query not use edismax?
 If yes, we should make it search the field text_en with higher weight than 
 text_en_splitting by setting the boost parameter to
 ‎ text_en^2 text_eb_splitting^1
 Or?
 Paul


 -- fat fingered on my z10 --
   Message d'origine
 De: Eduard Moraru
 Envoyé: Dienstag, 5. Mai 2015 14:13
 À: XWiki Developers
 Répondre à: XWiki Developers
 Objet: Re: [xwiki-devs] [Solr] Word delimiter filter on English text

 Hi,

 The question is about content fields (document contet, textarea content,
 etc.) and not about the document's space name and document name fields,
 which will still match in both approaches, right?

 As far as I`ve understood it, text_en gets less matches than
 text_en_splitting, but text_en has better support for cases where in
 text_en_splitting you would have to use a phrase query to get the match
 (e.g. Blog.News, xwiki.com, etc.).

 IMO, text_en_splitting sounds more adapted to real life uses and to the
 fuzziness of user queries. If we want explicit matches for xwiki.com or
 Blog.News within a document's content, phrase queries can still be used,
 right? (i.e. quoting the explicit string).

 Thanks,
 Eduard


 On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea 
 mariusdumitru.flo...@xwiki.com wrote:

 Hi guys,

 I just noticed (while updating the screen shots for the Solr Search UI
 documentation [1]) that searching for blog doesn't match Blog.News
 from the category of BlogIntroduction any more as indicated in [2].

 Debug mode view shows me that Blog.News is indexed as blog.new
 which means the text is not split in blog and news as I would have
 expected in this case.

 After checking the Solr schema configuration I understood that this is
 normal considering that we use the Standard Tokenizer [3] for English
 text which has this exception:

 Periods (dots) that are not followed by whitespace are kept as part
 of the token, including Internet domain names.

 Further investigation showed that before 6.0M1 we used the Word
 Delimiter Filter [4] for English text but I changed this with
 XWIKI-8911 when upgrading to Solr 4.7.0.

 I then noticed that the Solr schema has both text_en and
 text_en_splitting types, the later with this comment:

 A text field with defaults appropriate for English, plus aggressive
 word-splitting and autophrase features enabled. This field is just
 like text_en, except it adds WordDelimiterFilter to enable splitting
 and matching of words on case-change, alpha numeric boundaries, and
 non-alphanumeric chars. This means certain compound word cases will
 work, for example query wi fi will match document WiFi or wi-fi.

 So in case someone wants to use this type instead for English text he
 needs to change the type in:

 dynamicField name=*_en type=text_en indexed=true stored=true
 multiValued=true /

 The question is whether we should use this type by default or not. As
 explained in the comment above, there are downsides.

 Thanks,
 Marius

 [1]
 http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application
 [2]
 http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png
 [3]
 https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer
 [4]
 https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
 ___
 devs mailing list
 devs@xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

 ___
 devs mailing list
 devs@xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs
 ___
 devs mailing list
 devs@xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs



 --
 Sergiu Dumitriu
 http://purl.org/net/sergiu

Re: [xwiki-devs] [Solr] Word delimiter filter on English text

2015-05-07 Thread Marius Dumitru Florea
On Tue, May 5, 2015 at 3:12 PM, Eduard Moraru enygma2...@gmail.com wrote:
 Hi,


 The question is about content fields (document contet, textarea content,
 etc.) and not about the document's space name and document name fields,
 which will still match in both approaches, right?

The question is about the fields that are indexed depending on the
document locale.


 As far as I`ve understood it, text_en gets less matches than
 text_en_splitting, but text_en has better support for cases where in
 text_en_splitting you would have to use a phrase query to get the match
 (e.g. Blog.News, xwiki.com, etc.).

With text_en_splitting a search for Blog.News will also match blog
news because the phrase from the query is analyzed in the same way it
would have been indexed.


 IMO, text_en_splitting sounds more adapted to real life uses and to the
 fuzziness of user queries. If we want explicit matches for xwiki.com or
 Blog.News within a document's content, phrase queries can still be used,
 right? (i.e. quoting the explicit string).

 Thanks,
 Eduard


 On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea 
 mariusdumitru.flo...@xwiki.com wrote:

 Hi guys,

 I just noticed (while updating the screen shots for the Solr Search UI
 documentation [1]) that searching for blog doesn't match Blog.News
 from the category of BlogIntroduction any more as indicated in [2].

 Debug mode view shows me that Blog.News is indexed as blog.new
 which means the text is not split in blog and news as I would have
 expected in this case.

 After checking the Solr schema configuration I understood that this is
 normal considering that we use the Standard Tokenizer [3] for English
 text which has this exception:

 Periods (dots) that are not followed by whitespace are kept as part
 of the token, including Internet domain names.

 Further investigation showed that before 6.0M1 we used the Word
 Delimiter Filter [4] for English text but I changed this with
 XWIKI-8911 when upgrading to Solr 4.7.0.

 I then noticed that the Solr schema has both text_en and
 text_en_splitting types, the later with this comment:

 A text field with defaults appropriate for English, plus aggressive
 word-splitting and autophrase features enabled. This field is just
 like text_en, except it adds WordDelimiterFilter to enable splitting
 and matching of words on case-change, alpha numeric boundaries, and
 non-alphanumeric chars. This means certain compound word cases will
 work, for example query wi fi will match document WiFi or wi-fi.

 So in case someone wants to use this type instead for English text he
 needs to change the type in:

 dynamicField name=*_en type=text_en indexed=true stored=true
 multiValued=true /

 The question is whether we should use this type by default or not. As
 explained in the comment above, there are downsides.

 Thanks,
 Marius

 [1]
 http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application
 [2]
 http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png
 [3]
 https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer
 [4]
 https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
 ___
 devs mailing list
 devs@xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

 ___
 devs mailing list
 devs@xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs
___
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs


Re: [xwiki-devs] [Solr] Word delimiter filter on English text

2015-05-05 Thread Paul Libbrecht
Eddy,
We want both or?
Dies the query not use edismax? 
If yes, we should make it search the field text_en with higher weight than 
text_en_splitting by setting the boost parameter to
‎ text_en^2 text_eb_splitting^1
Or?
Paul


-- fat fingered on my z10 --
  Message d'origine  
De: Eduard Moraru
Envoyé: Dienstag, 5. Mai 2015 14:13
À: XWiki Developers
Répondre à: XWiki Developers
Objet: Re: [xwiki-devs] [Solr] Word delimiter filter on English text

Hi,

The question is about content fields (document contet, textarea content,
etc.) and not about the document's space name and document name fields,
which will still match in both approaches, right?

As far as I`ve understood it, text_en gets less matches than
text_en_splitting, but text_en has better support for cases where in
text_en_splitting you would have to use a phrase query to get the match
(e.g. Blog.News, xwiki.com, etc.).

IMO, text_en_splitting sounds more adapted to real life uses and to the
fuzziness of user queries. If we want explicit matches for xwiki.com or
Blog.News within a document's content, phrase queries can still be used,
right? (i.e. quoting the explicit string).

Thanks,
Eduard


On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea 
mariusdumitru.flo...@xwiki.com wrote:

 Hi guys,

 I just noticed (while updating the screen shots for the Solr Search UI
 documentation [1]) that searching for blog doesn't match Blog.News
 from the category of BlogIntroduction any more as indicated in [2].

 Debug mode view shows me that Blog.News is indexed as blog.new
 which means the text is not split in blog and news as I would have
 expected in this case.

 After checking the Solr schema configuration I understood that this is
 normal considering that we use the Standard Tokenizer [3] for English
 text which has this exception:

 Periods (dots) that are not followed by whitespace are kept as part
 of the token, including Internet domain names.

 Further investigation showed that before 6.0M1 we used the Word
 Delimiter Filter [4] for English text but I changed this with
 XWIKI-8911 when upgrading to Solr 4.7.0.

 I then noticed that the Solr schema has both text_en and
 text_en_splitting types, the later with this comment:

 A text field with defaults appropriate for English, plus aggressive
 word-splitting and autophrase features enabled. This field is just
 like text_en, except it adds WordDelimiterFilter to enable splitting
 and matching of words on case-change, alpha numeric boundaries, and
 non-alphanumeric chars. This means certain compound word cases will
 work, for example query wi fi will match document WiFi or wi-fi.

 So in case someone wants to use this type instead for English text he
 needs to change the type in:

 dynamicField name=*_en type=text_en indexed=true stored=true
 multiValued=true /

 The question is whether we should use this type by default or not. As
 explained in the comment above, there are downsides.

 Thanks,
 Marius

 [1]
 http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application
 [2]
 http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png
 [3]
 https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer
 [4]
 https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
 ___
 devs mailing list
 devs@xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

___
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs
___
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs


Re: [xwiki-devs] [Solr] Word delimiter filter on English text

2015-05-05 Thread Sergiu Dumitriu
I agree with Paul.

The way I usually do searches is:

- each field gets indexed several times, including:
-- exact matches ^5n (field == query)
-- prefix matches ^1.5n (field ^= query)
-- same spelling ^1.8n (query words in field)
-- fuzzy matching ^n (aggressive tokenization and stemming)
-- stub matching ^.5n (query tokens are prefixes of indexed tokens)
-- and three catch-all fields where every other field gets copied, with
spelling, fuzzy and stub variants
- where n is a factor based on the field's importance: page title and
name have the highest boost, a catch-all field has the lowest boost
- search with edismax, pf with double the boost (2n) on
exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub

On 05/05/2015 08:28 AM, Paul Libbrecht wrote:
 Eddy,
 We want both or?
 Dies the query not use edismax? 
 If yes, we should make it search the field text_en with higher weight than 
 text_en_splitting by setting the boost parameter to
 ‎ text_en^2 text_eb_splitting^1
 Or?
 Paul
 
 
 -- fat fingered on my z10 --
   Message d'origine  
 De: Eduard Moraru
 Envoyé: Dienstag, 5. Mai 2015 14:13
 À: XWiki Developers
 Répondre à: XWiki Developers
 Objet: Re: [xwiki-devs] [Solr] Word delimiter filter on English text
 
 Hi,
 
 The question is about content fields (document contet, textarea content,
 etc.) and not about the document's space name and document name fields,
 which will still match in both approaches, right?
 
 As far as I`ve understood it, text_en gets less matches than
 text_en_splitting, but text_en has better support for cases where in
 text_en_splitting you would have to use a phrase query to get the match
 (e.g. Blog.News, xwiki.com, etc.).
 
 IMO, text_en_splitting sounds more adapted to real life uses and to the
 fuzziness of user queries. If we want explicit matches for xwiki.com or
 Blog.News within a document's content, phrase queries can still be used,
 right? (i.e. quoting the explicit string).
 
 Thanks,
 Eduard
 
 
 On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea 
 mariusdumitru.flo...@xwiki.com wrote:
 
 Hi guys,

 I just noticed (while updating the screen shots for the Solr Search UI
 documentation [1]) that searching for blog doesn't match Blog.News
 from the category of BlogIntroduction any more as indicated in [2].

 Debug mode view shows me that Blog.News is indexed as blog.new
 which means the text is not split in blog and news as I would have
 expected in this case.

 After checking the Solr schema configuration I understood that this is
 normal considering that we use the Standard Tokenizer [3] for English
 text which has this exception:

 Periods (dots) that are not followed by whitespace are kept as part
 of the token, including Internet domain names.

 Further investigation showed that before 6.0M1 we used the Word
 Delimiter Filter [4] for English text but I changed this with
 XWIKI-8911 when upgrading to Solr 4.7.0.

 I then noticed that the Solr schema has both text_en and
 text_en_splitting types, the later with this comment:

 A text field with defaults appropriate for English, plus aggressive
 word-splitting and autophrase features enabled. This field is just
 like text_en, except it adds WordDelimiterFilter to enable splitting
 and matching of words on case-change, alpha numeric boundaries, and
 non-alphanumeric chars. This means certain compound word cases will
 work, for example query wi fi will match document WiFi or wi-fi.

 So in case someone wants to use this type instead for English text he
 needs to change the type in:

 dynamicField name=*_en type=text_en indexed=true stored=true
 multiValued=true /

 The question is whether we should use this type by default or not. As
 explained in the comment above, there are downsides.

 Thanks,
 Marius

 [1]
 http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application
 [2]
 http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png
 [3]
 https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer
 [4]
 https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
 ___
 devs mailing list
 devs@xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

 ___
 devs mailing list
 devs@xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs
 ___
 devs mailing list
 devs@xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs
 


-- 
Sergiu Dumitriu
http://purl.org/net/sergiu/
___
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs


Re: [xwiki-devs] [Solr] Word delimiter filter on English text

2015-05-05 Thread Eduard Moraru
Hi,

The question is about content fields (document contet, textarea content,
etc.) and not about the document's space name and document name fields,
which will still match in both approaches, right?

As far as I`ve understood it, text_en gets less matches than
text_en_splitting, but text_en has better support for cases where in
text_en_splitting you would have to use a phrase query to get the match
(e.g. Blog.News, xwiki.com, etc.).

IMO, text_en_splitting sounds more adapted to real life uses and to the
fuzziness of user queries. If we want explicit matches for xwiki.com or
Blog.News within a document's content, phrase queries can still be used,
right? (i.e. quoting the explicit string).

Thanks,
Eduard


On Tue, May 5, 2015 at 12:55 PM, Marius Dumitru Florea 
mariusdumitru.flo...@xwiki.com wrote:

 Hi guys,

 I just noticed (while updating the screen shots for the Solr Search UI
 documentation [1]) that searching for blog doesn't match Blog.News
 from the category of BlogIntroduction any more as indicated in [2].

 Debug mode view shows me that Blog.News is indexed as blog.new
 which means the text is not split in blog and news as I would have
 expected in this case.

 After checking the Solr schema configuration I understood that this is
 normal considering that we use the Standard Tokenizer [3] for English
 text which has this exception:

 Periods (dots) that are not followed by whitespace are kept as part
 of the token, including Internet domain names.

 Further investigation showed that before 6.0M1 we used the Word
 Delimiter Filter [4] for English text but I changed this with
 XWIKI-8911 when upgrading to Solr 4.7.0.

 I then noticed that the Solr schema has both text_en and
 text_en_splitting types, the later with this comment:

 A text field with defaults appropriate for English, plus aggressive
 word-splitting and autophrase features enabled. This field is just
 like text_en, except it adds WordDelimiterFilter to enable splitting
 and matching of words on case-change, alpha numeric boundaries, and
 non-alphanumeric chars. This means certain compound word cases will
 work, for example query wi fi will match document WiFi or wi-fi.

 So in case someone wants to use this type instead for English text he
 needs to change the type in:

 dynamicField name=*_en type=text_en indexed=true stored=true
 multiValued=true /

 The question is whether we should use this type by default or not. As
 explained in the comment above, there are downsides.

 Thanks,
 Marius

 [1]
 http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application
 [2]
 http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png
 [3]
 https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer
 [4]
 https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
 ___
 devs mailing list
 devs@xwiki.org
 http://lists.xwiki.org/mailman/listinfo/devs

___
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs


[xwiki-devs] [Solr] Word delimiter filter on English text

2015-05-05 Thread Marius Dumitru Florea
Hi guys,

I just noticed (while updating the screen shots for the Solr Search UI
documentation [1]) that searching for blog doesn't match Blog.News
from the category of BlogIntroduction any more as indicated in [2].

Debug mode view shows me that Blog.News is indexed as blog.new
which means the text is not split in blog and news as I would have
expected in this case.

After checking the Solr schema configuration I understood that this is
normal considering that we use the Standard Tokenizer [3] for English
text which has this exception:

Periods (dots) that are not followed by whitespace are kept as part
of the token, including Internet domain names.

Further investigation showed that before 6.0M1 we used the Word
Delimiter Filter [4] for English text but I changed this with
XWIKI-8911 when upgrading to Solr 4.7.0.

I then noticed that the Solr schema has both text_en and
text_en_splitting types, the later with this comment:

A text field with defaults appropriate for English, plus aggressive
word-splitting and autophrase features enabled. This field is just
like text_en, except it adds WordDelimiterFilter to enable splitting
and matching of words on case-change, alpha numeric boundaries, and
non-alphanumeric chars. This means certain compound word cases will
work, for example query wi fi will match document WiFi or wi-fi.

So in case someone wants to use this type instead for English text he
needs to change the type in:

dynamicField name=*_en type=text_en indexed=true stored=true
multiValued=true /

The question is whether we should use this type by default or not. As
explained in the comment above, there are downsides.

Thanks,
Marius

[1] http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application
[2] 
http://extensions.xwiki.org/xwiki/bin/download/Extension/Solr+Search+Application/searchHighlighting.png
[3] 
https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer
[4] 
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
___
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs