It seems that existing write cache stores data in unsorted manner (hash-table). I cannot come up with anything smarter than using persistent sorted-map for write cache as implemented in my project Rhinodog. Persistent - to let readers work without locking, sorted-map to access documentIDs for particular termID in order. My implementation indexes text about 2.5 times slower using existing EnglishAnalyzer, so I'm wandering if this is a good trade off. Probably for some use cases it's desirable, but not for all. Also I'm new to Lucene, and don't feel like throwing away code that has been here longer that I've been writing code. Probably real time search is not very important ?
2016-07-14 15:55 GMT+03:00 Michael McCandless <[email protected]>: > Your RAMDirectory option is what NRTCachingDirectory does I think? Small > files are written in RAM, and only on merging them into larger files, do we > write those files to the real directory. It's not clear it's that helpful, > though, because the OS does similar write caching, more efficiently. > > But even with RAMDirectory, you need to periodically open a new searcher > ... which makes it *near* real time, not truly real time like the Twitter > solution. > > Unfortunately, the crazy classes like BytesRefHash, TermsHash, etc., do > not have any documentation beyond what comments you see in their sources > ... maybe try looking at their test cases, or how the classes are used by > other classes in Lucene. > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Jul 14, 2016 at 8:14 AM, Konstantin <[email protected]> > wrote: > >> Hello Michael, >> Maybe this problem is already solved/(can be solved) on a different level >> of abstraction (in Solr or Elasticsearch) - write new documents to both >> persistent index and RAMDirectory, so new docs will be queried from it >> immediately. >> My motivation for this is to learn from Lucene. Could you please suggest >> any source of information on BytesRefHash, TermsHash and the whole >> indexing process ? >> Changing anything in there looks like a complex task to me too. >> >> >> 2016-07-14 11:54 GMT+03:00 Michael McCandless <[email protected]> >> : >> >>> Another example is Michael Busch's work while at Twitter, extending >>> Lucene so you can do real-time searches of the write cache ... here's a >>> paper describing it: >>> http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.pdf >>> >>> But this was a very heavy modification of Lucene and wasn't ever >>> contributed back. >>> >>> I do think it should be possible (just complex!) to have real-time >>> searching of recently indexed documents, and the sorted terms is really >>> only needed if you must support multi-term queries. >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Tue, Jul 12, 2016 at 12:29 PM, Adrien Grand <[email protected]> >>> wrote: >>> >>>> This is not something I am very familiar with, but this issue >>>> https://issues.apache.org/jira/browse/LUCENE-2312 tried to improve NRT >>>> latency by adding the ability to search directly into the indexing buffer >>>> of the index writer. >>>> >>>> Le mar. 12 juil. 2016 à 16:11, Konstantin <[email protected]> a >>>> écrit : >>>> >>>>> Hello everyone, >>>>> As far as I understand NRT requires flushing new segment to disk. Is >>>>> it correct that write cache is not searchable ? >>>>> >>>>> Competing search library groonga >>>>> <http://groonga.org/docs/characteristic.html> - claim that they have >>>>> much smaller realtime search latency (as far as I understand via >>>>> searchable >>>>> write-cache), but loading data into their index takes almost three times >>>>> longer (benchmark in blog post in Japanese >>>>> <http://blog.createfield.com/entry/2014/07/22/080958> , seems like >>>>> wikipedia XML, I'm not sure if it's English one ). >>>>> >>>>> I've created incomplete prototype of searchable write cache in my pet >>>>> project <https://github.com/kk00ss/Rhinodog> - and it takes two times >>>>> longer to index fraction of wikipedia using same EnglishAnalyzer from >>>>> lucene.analysis (probably there is a room for optimizations). While >>>>> loading >>>>> data into Lucene I didn't reuse Document instances. Searchable write-cache >>>>> was implemented as a bunch of persistent scala's SortedMap[TermKey, >>>>> Measure](), one per logical core. Where TermKey is defined as >>>>> TermKey(termID:Int, >>>>> docID: Long)and Measure is just frequency and norm (but could be >>>>> extended). >>>>> >>>>> Do you think it's worth the slowdown ? If so I'm interested to learn >>>>> how this part of Lucene works while implementing this feature. However it >>>>> is unclear to me how hard would it be to change existing implementation. I >>>>> cannot wrap my head around TermHash and the whole flush process - are >>>>> there >>>>> any documentation, good blog posts to read about it ? >>>>> >>>>> >>> >> >
