Near real time search improvement

Konstantin Tue, 12 Jul 2016 07:11:49 -0700

Hello everyone,
As far as I understand NRT requires flushing new segment to disk. Is it
correct that write cache is not searchable ?


Competing search library groonga
<http://groonga.org/docs/characteristic.html> - claim that they have much
smaller realtime search latency (as far as I understand via searchable
write-cache), but loading data into their index takes almost three times
longer (benchmark in blog post in Japanese
<http://blog.createfield.com/entry/2014/07/22/080958> , seems like
 wikipedia XML, I'm not sure if it's English one ).

I've created incomplete prototype of searchable write cache in my pet
project <https://github.com/kk00ss/Rhinodog> - and it takes two times
longer to index fraction of wikipedia using same EnglishAnalyzer from
lucene.analysis (probably there is a room for optimizations). While loading
data into Lucene I didn't reuse Document instances. Searchable write-cache
was implemented as a bunch of persistent  scala's SortedMap[TermKey,
Measure](), one per logical core. Where TermKey is defined as
TermKey(termID:Int,
docID: Long)and Measure is just frequency and norm (but could be extended).

Do you think it's worth the slowdown ? If so I'm interested to learn how
this part of Lucene works while implementing this feature. However it is
unclear to me how hard would it be to change existing implementation. I
cannot wrap my head around TermHash and the whole flush process - are there
any documentation, good blog posts to read about it ?

Near real time search improvement

Reply via email to