I think for most users, "near" real time is good enough, especially when you can control what "near" is for your use case. E.g., Elasticsearch defaults to opening a new searcher once per second.
Mike McCandless http://blog.mikemccandless.com On Mon, Jul 18, 2016 at 7:27 AM, Konstantin <[email protected]> wrote: > It seems that existing write cache stores data in unsorted manner > (hash-table). > I cannot come up with anything smarter than using persistent sorted-map > for write cache as implemented in my project Rhinodog. > Persistent - to let readers work without locking, sorted-map to access > documentIDs for particular termID in order. > My implementation indexes text about 2.5 times slower using existing > EnglishAnalyzer, so I'm wandering if this is a good trade off. > Probably for some use cases it's desirable, but not for all. > Also I'm new to Lucene, and don't feel like throwing away code that has > been here longer that I've been writing code. > Probably real time search is not very important ? > > > 2016-07-14 15:55 GMT+03:00 Michael McCandless <[email protected]>: > >> Your RAMDirectory option is what NRTCachingDirectory does I think? Small >> files are written in RAM, and only on merging them into larger files, do we >> write those files to the real directory. It's not clear it's that helpful, >> though, because the OS does similar write caching, more efficiently. >> >> But even with RAMDirectory, you need to periodically open a new searcher >> ... which makes it *near* real time, not truly real time like the Twitter >> solution. >> >> Unfortunately, the crazy classes like BytesRefHash, TermsHash, etc., do >> not have any documentation beyond what comments you see in their sources >> ... maybe try looking at their test cases, or how the classes are used by >> other classes in Lucene. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Thu, Jul 14, 2016 at 8:14 AM, Konstantin <[email protected]> >> wrote: >> >>> Hello Michael, >>> Maybe this problem is already solved/(can be solved) on a different >>> level of abstraction (in Solr or Elasticsearch) - write new documents to >>> both persistent index and RAMDirectory, so new docs will be queried from it >>> immediately. >>> My motivation for this is to learn from Lucene. Could you please suggest >>> any source of information on BytesRefHash, TermsHash and the whole >>> indexing process ? >>> Changing anything in there looks like a complex task to me too. >>> >>> >>> 2016-07-14 11:54 GMT+03:00 Michael McCandless <[email protected] >>> >: >>> >>>> Another example is Michael Busch's work while at Twitter, extending >>>> Lucene so you can do real-time searches of the write cache ... here's a >>>> paper describing it: >>>> http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.pdf >>>> >>>> But this was a very heavy modification of Lucene and wasn't ever >>>> contributed back. >>>> >>>> I do think it should be possible (just complex!) to have real-time >>>> searching of recently indexed documents, and the sorted terms is really >>>> only needed if you must support multi-term queries. >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> On Tue, Jul 12, 2016 at 12:29 PM, Adrien Grand <[email protected]> >>>> wrote: >>>> >>>>> This is not something I am very familiar with, but this issue >>>>> https://issues.apache.org/jira/browse/LUCENE-2312 tried to improve >>>>> NRT latency by adding the ability to search directly into the indexing >>>>> buffer of the index writer. >>>>> >>>>> Le mar. 12 juil. 2016 à 16:11, Konstantin <[email protected]> >>>>> a écrit : >>>>> >>>>>> Hello everyone, >>>>>> As far as I understand NRT requires flushing new segment to disk. Is >>>>>> it correct that write cache is not searchable ? >>>>>> >>>>>> Competing search library groonga >>>>>> <http://groonga.org/docs/characteristic.html> - claim that they have >>>>>> much smaller realtime search latency (as far as I understand via >>>>>> searchable >>>>>> write-cache), but loading data into their index takes almost three times >>>>>> longer (benchmark in blog post in Japanese >>>>>> <http://blog.createfield.com/entry/2014/07/22/080958> , seems like >>>>>> wikipedia XML, I'm not sure if it's English one ). >>>>>> >>>>>> I've created incomplete prototype of searchable write cache in my >>>>>> pet project <https://github.com/kk00ss/Rhinodog> - and it takes two >>>>>> times longer to index fraction of wikipedia using same EnglishAnalyzer >>>>>> from >>>>>> lucene.analysis (probably there is a room for optimizations). While >>>>>> loading >>>>>> data into Lucene I didn't reuse Document instances. Searchable >>>>>> write-cache >>>>>> was implemented as a bunch of persistent scala's SortedMap[TermKey, >>>>>> Measure](), one per logical core. Where TermKey is defined as >>>>>> TermKey(termID:Int, >>>>>> docID: Long)and Measure is just frequency and norm (but could be >>>>>> extended). >>>>>> >>>>>> Do you think it's worth the slowdown ? If so I'm interested to learn >>>>>> how this part of Lucene works while implementing this feature. However it >>>>>> is unclear to me how hard would it be to change existing implementation. >>>>>> I >>>>>> cannot wrap my head around TermHash and the whole flush process - are >>>>>> there >>>>>> any documentation, good blog posts to read about it ? >>>>>> >>>>>> >>>> >>> >> >
