On Fri, May 8, 2015 at 10:39 AM, Sergiu Dumitriu <[email protected]> wrote: > Well, my usecase is not the same, since I'm indexing ontologies and the > end purpose is to find the best matching terms. A few numbers though: > > - 4MB ontology with 11k terms ends up as 16M index (including > spellcheck, and most fields are also stored), searches take ~40ms > including the XWiki overhead, ~10ms just in Solr > - 180MB ontology with 24k terms -> 100M index, ~15ms Solr search time > > For smaller indexes, it does seem to use more disk space than the > source, but Lucene is good at indexing larger data sets, and after a > while the index grows slower than the data. >
> For me it is worth the extra disk space, since every user is amazed by > how good the search is at finding the relevant terms, overcoming typos, > synonyms, and abbreviations, plus autocomplete while typing. You do this for multiple languages or just for English? In other words, do you have text_fr_splitting, text_es_splitting etc.? Thanks Sergiu, I'll definitely take this into account. Marius > > In XWiki, not all fields should be indexed in all the ways, since it > doesn't make sense to expect an exact match on a large textarea or the > document content. > > On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote: >> Hi Sergiu, >> >> Can you tell us the effect on the index size (and speed in the end) if >> each field (e.g. document title, a String or TextArea property) is >> indexed in 5 different ways (5 separate fields in the index)? It is >> worth having this configuration by default? >> >> Thanks, >> Marius >> >> On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu <[email protected]> wrote: >>> I agree with Paul. >>> >>> The way I usually do searches is: >>> >>> - each field gets indexed several times, including: >>> -- exact matches ^5n (field == query) >>> -- prefix matches ^1.5n (field ^= query) >>> -- same spelling ^1.8n (query words in field) >>> -- fuzzy matching ^n (aggressive tokenization and stemming) >>> -- stub matching ^.5n (query tokens are prefixes of indexed tokens) >>> -- and three catch-all fields where every other field gets copied, with >>> spelling, fuzzy and stub variants >>> - where n is a factor based on the field's importance: page title and >>> name have the highest boost, a catch-all field has the lowest boost >>> - search with edismax, pf with double the boost (2n) on >>> exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub >>> > -- > Sergiu Dumitriu > http://purl.org/net/sergiu/ > > _______________________________________________ > devs mailing list > [email protected] > http://lists.xwiki.org/mailman/listinfo/devs _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

