On Fri, May 8, 2015 at 10:39 AM, Sergiu Dumitriu <[email protected]> wrote:
> Well, my usecase is not the same, since I'm indexing ontologies and the
> end purpose is to find the best matching terms. A few numbers though:
>
> - 4MB ontology with 11k terms ends up as 16M index (including
> spellcheck, and most fields are also stored), searches take ~40ms
> including the XWiki overhead, ~10ms just in Solr
> - 180MB ontology with 24k terms -> 100M index, ~15ms Solr search time
>
> For smaller indexes, it does seem to use more disk space than the
> source, but Lucene is good at indexing larger data sets, and after a
> while the index grows slower than the data.
>

> For me it is worth the extra disk space, since every user is amazed by
> how good the search is at finding the relevant terms, overcoming typos,
> synonyms, and abbreviations, plus autocomplete while typing.

You do this for multiple languages or just for English? In other
words, do you have text_fr_splitting, text_es_splitting etc.?

Thanks Sergiu, I'll definitely take this into account.
Marius

>
> In XWiki, not all fields should be indexed in all the ways, since it
> doesn't make sense to expect an exact match on a large textarea or the
> document content.
>
> On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote:
>> Hi Sergiu,
>>
>> Can you tell us the effect on the index size (and speed in the end) if
>> each field (e.g. document title, a String or TextArea property) is
>> indexed in 5 different ways (5 separate fields in the index)? It is
>> worth having this configuration by default?
>>
>> Thanks,
>> Marius
>>
>> On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu <[email protected]> wrote:
>>> I agree with Paul.
>>>
>>> The way I usually do searches is:
>>>
>>> - each field gets indexed several times, including:
>>> -- exact matches ^5n (field == query)
>>> -- prefix matches ^1.5n (field ^= query)
>>> -- same spelling ^1.8n (query words in field)
>>> -- fuzzy matching ^n (aggressive tokenization and stemming)
>>> -- stub matching ^.5n (query tokens are prefixes of indexed tokens)
>>> -- and three catch-all fields where every other field gets copied, with
>>> spelling, fuzzy and stub variants
>>> - where n is a factor based on the field's importance: page title and
>>> name have the highest boost, a catch-all field has the lowest boost
>>> - search with edismax, pf with double the boost (2n) on
>>> exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub
>>>
> --
> Sergiu Dumitriu
> http://purl.org/net/sergiu/
>
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Reply via email to