On 05/08/2015 05:22 AM, Marius Dumitru Florea wrote: > On Fri, May 8, 2015 at 10:39 AM, Sergiu Dumitriu <ser...@xwiki.org> wrote: >> Well, my usecase is not the same, since I'm indexing ontologies and the >> end purpose is to find the best matching terms. A few numbers though: >> >> - 4MB ontology with 11k terms ends up as 16M index (including >> spellcheck, and most fields are also stored), searches take ~40ms >> including the XWiki overhead, ~10ms just in Solr >> - 180MB ontology with 24k terms -> 100M index, ~15ms Solr search time >> >> For smaller indexes, it does seem to use more disk space than the >> source, but Lucene is good at indexing larger data sets, and after a >> while the index grows slower than the data. >> > >> For me it is worth the extra disk space, since every user is amazed by >> how good the search is at finding the relevant terms, overcoming typos, >> synonyms, and abbreviations, plus autocomplete while typing. > > You do this for multiple languages or just for English? In other > words, do you have text_fr_splitting, text_es_splitting etc.?
At the moment only English. > Thanks Sergiu, I'll definitely take this into account. > Marius > >> >> In XWiki, not all fields should be indexed in all the ways, since it >> doesn't make sense to expect an exact match on a large textarea or the >> document content. >> >> On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote: >>> Hi Sergiu, >>> >>> Can you tell us the effect on the index size (and speed in the end) if >>> each field (e.g. document title, a String or TextArea property) is >>> indexed in 5 different ways (5 separate fields in the index)? It is >>> worth having this configuration by default? >>> >>> Thanks, >>> Marius >>> >>> On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu <ser...@xwiki.org> wrote: >>>> I agree with Paul. >>>> >>>> The way I usually do searches is: >>>> >>>> - each field gets indexed several times, including: >>>> -- exact matches ^5n (field == query) >>>> -- prefix matches ^1.5n (field ^= query) >>>> -- same spelling ^1.8n (query words in field) >>>> -- fuzzy matching ^n (aggressive tokenization and stemming) >>>> -- stub matching ^.5n (query tokens are prefixes of indexed tokens) >>>> -- and three catch-all fields where every other field gets copied, with >>>> spelling, fuzzy and stub variants >>>> - where n is a factor based on the field's importance: page title and >>>> name have the highest boost, a catch-all field has the lowest boost >>>> - search with edismax, pf with double the boost (2n) on >>>> exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub >>>> -- Sergiu Dumitriu http://purl.org/net/sergiu _______________________________________________ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs