On 05/08/2015 05:22 AM, Marius Dumitru Florea wrote:
> On Fri, May 8, 2015 at 10:39 AM, Sergiu Dumitriu <ser...@xwiki.org> wrote:
>> Well, my usecase is not the same, since I'm indexing ontologies and the
>> end purpose is to find the best matching terms. A few numbers though:
>>
>> - 4MB ontology with 11k terms ends up as 16M index (including
>> spellcheck, and most fields are also stored), searches take ~40ms
>> including the XWiki overhead, ~10ms just in Solr
>> - 180MB ontology with 24k terms -> 100M index, ~15ms Solr search time
>>
>> For smaller indexes, it does seem to use more disk space than the
>> source, but Lucene is good at indexing larger data sets, and after a
>> while the index grows slower than the data.
>>
> 
>> For me it is worth the extra disk space, since every user is amazed by
>> how good the search is at finding the relevant terms, overcoming typos,
>> synonyms, and abbreviations, plus autocomplete while typing.
> 
> You do this for multiple languages or just for English? In other
> words, do you have text_fr_splitting, text_es_splitting etc.?

At the moment only English.

> Thanks Sergiu, I'll definitely take this into account.
> Marius
> 
>>
>> In XWiki, not all fields should be indexed in all the ways, since it
>> doesn't make sense to expect an exact match on a large textarea or the
>> document content.
>>
>> On 05/07/2015 09:57 AM, Marius Dumitru Florea wrote:
>>> Hi Sergiu,
>>>
>>> Can you tell us the effect on the index size (and speed in the end) if
>>> each field (e.g. document title, a String or TextArea property) is
>>> indexed in 5 different ways (5 separate fields in the index)? It is
>>> worth having this configuration by default?
>>>
>>> Thanks,
>>> Marius
>>>
>>> On Tue, May 5, 2015 at 4:57 PM, Sergiu Dumitriu <ser...@xwiki.org> wrote:
>>>> I agree with Paul.
>>>>
>>>> The way I usually do searches is:
>>>>
>>>> - each field gets indexed several times, including:
>>>> -- exact matches ^5n (field == query)
>>>> -- prefix matches ^1.5n (field ^= query)
>>>> -- same spelling ^1.8n (query words in field)
>>>> -- fuzzy matching ^n (aggressive tokenization and stemming)
>>>> -- stub matching ^.5n (query tokens are prefixes of indexed tokens)
>>>> -- and three catch-all fields where every other field gets copied, with
>>>> spelling, fuzzy and stub variants
>>>> - where n is a factor based on the field's importance: page title and
>>>> name have the highest boost, a catch-all field has the lowest boost
>>>> - search with edismax, pf with double the boost (2n) on
>>>> exact,prefix,spelling,fuzzy and qf on spelling,fuzzy,stub
>>>>

-- 
Sergiu Dumitriu
http://purl.org/net/sergiu
_______________________________________________
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs

Reply via email to