Re: [xwiki-devs] [Solr] Word delimiter filter on English text
On Fri, May 8, 2015 at 10:39 AM, Sergiu Dumitriu ser...@xwiki.org wrote: Well, my usecase is not the same, since I'm indexing ontologies and the end purpose is to find the best matching terms. A few numbers though: - 4MB ontology with 11k terms ends up as 16M index (including spellcheck, and most fields are also stored), searches take ~40ms including the XWiki overhead, ~10ms just in Solr - 180MB ontology with 24k terms - 100M index, ~15ms Solr search time For smaller indexes, it does seem to use more disk space than the source, but Lucene is good at indexing larger data sets, and after a while the index grows slower than the data. For me it is worth the extra disk space, since every user is amazed by how good the search is at finding the relevant terms, overcoming typos, synonyms, and abbreviations, plus autocomplete while typing. You do this for multiple languages or just for English? In other words, do you have text_fr_splitting, text_es_splitting etc.? Thanks Sergiu, I'll definitely take this into account. Marius In XWiki, not all fields should be indexed in all the ways, since it doesn't make sense to expect an exact match on a large textarea or the document content. On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote: Hi Sergiu, Can you tell us the effect on the index size (and speed in the end) if each field (e.g. document title, a String or TextArea property) is indexed in 5 different ways (5 separate fields in the index)? It is worth having this configuration by default? Thanks, Marius On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu ser...@xwiki.org wrote: I agree with Paul. The way I usually do searches is: - each field gets indexed several times, including: -- exact matches ^5n (field == query) -- prefix matches ^1.5n (field ^= query) -- same spelling ^1.8n (query words in field) -- fuzzy matching ^n (aggressive tokenization and stemming) -- stub matching ^.5n (query tokens are prefixes of indexed tokens) -- and three catch-all fields where every other field gets copied, with spelling, fuzzy and stub variants - where n is a factor based on the field's importance: page title and name have the highest boost, a catch-all field has the lowest boost - search with edismax, pf with double the boost (2n) on exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub -- Sergiu Dumitriu http://purl.org/net/sergiu/ ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs
Re: [xwiki-devs] [Solr] Word delimiter filter on English text
Avoid storing everything imperatively! (this was done earlier in the Lucene plugin and has been the main cause of the slowness). My general rule of thumb is that an index is 10% of a text file. I really would not be scared by indexing 5 different fields of the text. paul On 8/05/15 08:39, Sergiu Dumitriu wrote: Well, my usecase is not the same, since I'm indexing ontologies and the end purpose is to find the best matching terms. A few numbers though: - 4MB ontology with 11k terms ends up as 16M index (including spellcheck, and most fields are also stored), searches take ~40ms including the XWiki overhead, ~10ms just in Solr - 180MB ontology with 24k terms - 100M index, ~15ms Solr search time For smaller indexes, it does seem to use more disk space than the source, but Lucene is good at indexing larger data sets, and after a while the index grows slower than the data. For me it is worth the extra disk space, since every user is amazed by how good the search is at finding the relevant terms, overcoming typos, synonyms, and abbreviations, plus autocomplete while typing. In XWiki, not all fields should be indexed in all the ways, since it doesn't make sense to expect an exact match on a large textarea or the document content. On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote: Hi Sergiu, Can you tell us the effect on the index size (and speed in the end) if each field (e.g. document title, a String or TextArea property) is indexed in 5 different ways (5 separate fields in the index)? It is worth having this configuration by default? Thanks, Marius On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu ser...@xwiki.org wrote: I agree with Paul. The way I usually do searches is: - each field gets indexed several times, including: -- exact matches ^5n (field == query) -- prefix matches ^1.5n (field ^= query) -- same spelling ^1.8n (query words in field) -- fuzzy matching ^n (aggressive tokenization and stemming) -- stub matching ^.5n (query tokens are prefixes of indexed tokens) -- and three catch-all fields where every other field gets copied, with spelling, fuzzy and stub variants - where n is a factor based on the field's importance: page title and name have the highest boost, a catch-all field has the lowest boost - search with edismax, pf with double the boost (2n) on exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub signature.asc Description: OpenPGP digital signature ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs
Re: [xwiki-devs] [Solr] Word delimiter filter on English text
Well, my usecase is not the same, since I'm indexing ontologies and the end purpose is to find the best matching terms. A few numbers though: - 4MB ontology with 11k terms ends up as 16M index (including spellcheck, and most fields are also stored), searches take ~40ms including the XWiki overhead, ~10ms just in Solr - 180MB ontology with 24k terms - 100M index, ~15ms Solr search time For smaller indexes, it does seem to use more disk space than the source, but Lucene is good at indexing larger data sets, and after a while the index grows slower than the data. For me it is worth the extra disk space, since every user is amazed by how good the search is at finding the relevant terms, overcoming typos, synonyms, and abbreviations, plus autocomplete while typing. In XWiki, not all fields should be indexed in all the ways, since it doesn't make sense to expect an exact match on a large textarea or the document content. On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote: Hi Sergiu, Can you tell us the effect on the index size (and speed in the end) if each field (e.g. document title, a String or TextArea property) is indexed in 5 different ways (5 separate fields in the index)? It is worth having this configuration by default? Thanks, Marius On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu ser...@xwiki.org wrote: I agree with Paul. The way I usually do searches is: - each field gets indexed several times, including: -- exact matches ^5n (field == query) -- prefix matches ^1.5n (field ^= query) -- same spelling ^1.8n (query words in field) -- fuzzy matching ^n (aggressive tokenization and stemming) -- stub matching ^.5n (query tokens are prefixes of indexed tokens) -- and three catch-all fields where every other field gets copied, with spelling, fuzzy and stub variants - where n is a factor based on the field's importance: page title and name have the highest boost, a catch-all field has the lowest boost - search with edismax, pf with double the boost (2n) on exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub -- Sergiu Dumitriu http://purl.org/net/sergiu/ ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs
Re: [xwiki-devs] [Solr] Word delimiter filter on English text
On 05/08/2015 05:22 AM, Marius Dumitru Florea wrote: On Fri, May 8, 2015 at 10:39 AM, Sergiu Dumitriu ser...@xwiki.org wrote: Well, my usecase is not the same, since I'm indexing ontologies and the end purpose is to find the best matching terms. A few numbers though: - 4MB ontology with 11k terms ends up as 16M index (including spellcheck, and most fields are also stored), searches take ~40ms including the XWiki overhead, ~10ms just in Solr - 180MB ontology with 24k terms - 100M index, ~15ms Solr search time For smaller indexes, it does seem to use more disk space than the source, but Lucene is good at indexing larger data sets, and after a while the index grows slower than the data. For me it is worth the extra disk space, since every user is amazed by how good the search is at finding the relevant terms, overcoming typos, synonyms, and abbreviations, plus autocomplete while typing. You do this for multiple languages or just for English? In other words, do you have text_fr_splitting, text_es_splitting etc.? At the moment only English. Thanks Sergiu, I'll definitely take this into account. Marius In XWiki, not all fields should be indexed in all the ways, since it doesn't make sense to expect an exact match on a large textarea or the document content. On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote: Hi Sergiu, Can you tell us the effect on the index size (and speed in the end) if each field (e.g. document title, a String or TextArea property) is indexed in 5 different ways (5 separate fields in the index)? It is worth having this configuration by default? Thanks, Marius On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu ser...@xwiki.org wrote: I agree with Paul. The way I usually do searches is: - each field gets indexed several times, including: -- exact matches ^5n (field == query) -- prefix matches ^1.5n (field ^= query) -- same spelling ^1.8n (query words in field) -- fuzzy matching ^n (aggressive tokenization and stemming) -- stub matching ^.5n (query tokens are prefixes of indexed tokens) -- and three catch-all fields where every other field gets copied, with spelling, fuzzy and stub variants - where n is a factor based on the field's importance: page title and name have the highest boost, a catch-all field has the lowest boost - search with edismax, pf with double the boost (2n) on exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub -- Sergiu Dumitriu http://purl.org/net/sergiu ___ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs