Re: Near real time search improvement

Michael McCandless Thu, 14 Jul 2016 01:54:49 -0700

Another example is Michael Busch's work while at Twitter, extending Lucene
so you can do real-time searches of the write cache ... here's a paper
describing it:
http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.pdf


But this was a very heavy modification of Lucene and wasn't ever
contributed back.

I do think it should be possible (just complex!) to have real-time
searching of recently indexed documents, and the sorted terms is really
only needed if you must support multi-term queries.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jul 12, 2016 at 12:29 PM, Adrien Grand <[email protected]> wrote:

> This is not something I am very familiar with, but this issue
> https://issues.apache.org/jira/browse/LUCENE-2312 tried to improve NRT
> latency by adding the ability to search directly into the indexing buffer
> of the index writer.
>
> Le mar. 12 juil. 2016 à 16:11, Konstantin <[email protected]> a
> écrit :
>
>> Hello everyone,
>> As far as I understand NRT requires flushing new segment to disk. Is it
>> correct that write cache is not searchable ?
>>
>> Competing search library groonga
>> <http://groonga.org/docs/characteristic.html> - claim that they have
>> much smaller realtime search latency (as far as I understand via searchable
>> write-cache), but loading data into their index takes almost three times
>> longer (benchmark in blog post in Japanese
>> <http://blog.createfield.com/entry/2014/07/22/080958> , seems like
>>  wikipedia XML, I'm not sure if it's English one ).
>>
>> I've created incomplete prototype of searchable write cache in my pet
>> project <https://github.com/kk00ss/Rhinodog> - and it takes two times
>> longer to index fraction of wikipedia using same EnglishAnalyzer from
>> lucene.analysis (probably there is a room for optimizations). While loading
>> data into Lucene I didn't reuse Document instances. Searchable write-cache
>> was implemented as a bunch of persistent  scala's SortedMap[TermKey,
>> Measure](), one per logical core. Where TermKey is defined as 
>> TermKey(termID:Int,
>> docID: Long)and Measure is just frequency and norm (but could be
>> extended).
>>
>> Do you think it's worth the slowdown ? If so I'm interested to learn how
>> this part of Lucene works while implementing this feature. However it is
>> unclear to me how hard would it be to change existing implementation. I
>> cannot wrap my head around TermHash and the whole flush process - are there
>> any documentation, good blog posts to read about it ?
>>
>>

Re: Near real time search improvement

Reply via email to