Re: Near real time search improvement

Michael McCandless Tue, 19 Jul 2016 08:29:17 -0700

I think for most users, "near" real time is good enough, especially when
you can control what "near" is for your use case.  E.g., Elasticsearch
defaults to opening a new searcher once per second.


Mike McCandless

http://blog.mikemccandless.com

On Mon, Jul 18, 2016 at 7:27 AM, Konstantin <[email protected]>
wrote:

> It seems that existing write cache stores data in unsorted manner
> (hash-table).
> I cannot come up with anything smarter than using persistent sorted-map
> for write cache as implemented in my project Rhinodog.
> Persistent - to let readers work without locking, sorted-map to access
> documentIDs for particular termID in order.
> My implementation indexes text about 2.5 times slower using existing
> EnglishAnalyzer, so I'm wandering if this is a good trade off.
> Probably for some use cases it's desirable, but not for all.
> Also I'm new to Lucene, and don't feel like throwing away code that has
> been here longer that I've been writing code.
> Probably real time search is not very important ?
>
>
> 2016-07-14 15:55 GMT+03:00 Michael McCandless <[email protected]>:
>
>> Your RAMDirectory option is what NRTCachingDirectory does I think?  Small
>> files are written in RAM, and only on merging them into larger files, do we
>> write those files to the real directory.  It's not clear it's that helpful,
>> though, because the OS does similar write caching, more efficiently.
>>
>> But even with RAMDirectory, you need to periodically open a new searcher
>> ... which makes it *near* real time, not truly real time like the Twitter
>> solution.
>>
>> Unfortunately, the crazy classes like BytesRefHash, TermsHash, etc., do
>> not have any documentation beyond what comments you see in their sources
>> ... maybe try looking at their test cases, or how the classes are used by
>> other classes in Lucene.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Jul 14, 2016 at 8:14 AM, Konstantin <[email protected]>
>> wrote:
>>
>>> Hello Michael,
>>> Maybe this problem is already solved/(can be solved) on a different
>>> level of abstraction (in Solr or Elasticsearch) - write new documents to
>>> both persistent index and RAMDirectory, so new docs will be queried from it
>>> immediately.
>>> My motivation for this is to learn from Lucene. Could you please suggest
>>> any source of information on BytesRefHash, TermsHash   and the whole
>>> indexing process ?
>>> Changing anything in there looks like a complex task to me too.
>>>
>>>
>>> 2016-07-14 11:54 GMT+03:00 Michael McCandless <[email protected]
>>> >:
>>>
>>>> Another example is Michael Busch's work while at Twitter, extending
>>>> Lucene so you can do real-time searches of the write cache ... here's a
>>>> paper describing it:
>>>> http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.pdf
>>>>
>>>> But this was a very heavy modification of Lucene and wasn't ever
>>>> contributed back.
>>>>
>>>> I do think it should be possible (just complex!) to have real-time
>>>> searching of recently indexed documents, and the sorted terms is really
>>>> only needed if you must support multi-term queries.
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Tue, Jul 12, 2016 at 12:29 PM, Adrien Grand <[email protected]>
>>>> wrote:
>>>>
>>>>> This is not something I am very familiar with, but this issue
>>>>> https://issues.apache.org/jira/browse/LUCENE-2312 tried to improve
>>>>> NRT latency by adding the ability to search directly into the indexing
>>>>> buffer of the index writer.
>>>>>
>>>>> Le mar. 12 juil. 2016 à 16:11, Konstantin <[email protected]>
>>>>> a écrit :
>>>>>
>>>>>> Hello everyone,
>>>>>> As far as I understand NRT requires flushing new segment to disk. Is
>>>>>> it correct that write cache is not searchable ?
>>>>>>
>>>>>> Competing search library groonga
>>>>>> <http://groonga.org/docs/characteristic.html> - claim that they have
>>>>>> much smaller realtime search latency (as far as I understand via 
>>>>>> searchable
>>>>>> write-cache), but loading data into their index takes almost three times
>>>>>> longer (benchmark in blog post in Japanese
>>>>>> <http://blog.createfield.com/entry/2014/07/22/080958> , seems like
>>>>>>  wikipedia XML, I'm not sure if it's English one ).
>>>>>>
>>>>>> I've created incomplete prototype of searchable write cache in my
>>>>>> pet project <https://github.com/kk00ss/Rhinodog> - and it takes two
>>>>>> times longer to index fraction of wikipedia using same EnglishAnalyzer 
>>>>>> from
>>>>>> lucene.analysis (probably there is a room for optimizations). While 
>>>>>> loading
>>>>>> data into Lucene I didn't reuse Document instances. Searchable 
>>>>>> write-cache
>>>>>> was implemented as a bunch of persistent  scala's SortedMap[TermKey,
>>>>>> Measure](), one per logical core. Where TermKey is defined as 
>>>>>> TermKey(termID:Int,
>>>>>> docID: Long)and Measure is just frequency and norm (but could be
>>>>>> extended).
>>>>>>
>>>>>> Do you think it's worth the slowdown ? If so I'm interested to learn
>>>>>> how this part of Lucene works while implementing this feature. However it
>>>>>> is unclear to me how hard would it be to change existing implementation. 
>>>>>> I
>>>>>> cannot wrap my head around TermHash and the whole flush process - are 
>>>>>> there
>>>>>> any documentation, good blog posts to read about it ?
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Near real time search improvement

Reply via email to