Dawid Weiss commented on LUCENE-8438:

Sure. I tried to explain why I wrote ByteBuffersIndexInput separately 
(structural conditions instead of exception handlers, etc). It's just more 
appealing to me, but also I kept wondering what's going to happen if you have 
smaller page sizes (mmap will typically default to a single buffer, even on 
fairly sizeable files; this is no longer the case with smaller heap-chunked 

bq. We should also make everything package private, which is internal 

There are numerous classes around the codebase that use those in-memory buffer 
classes, so we can't make them package-private (we could make them 
module-private if we target the module system at some point).

Also, making everything package private gives a cleaner API, but it also forces 
virtually everyone who'd like to experiment with different directory/ buffer 
wrapper implementations to reinvent the wheel here (read: copy-paste). Look at 
LUCENE-8406 -- one can't even reuse that buffers-based IndexInput 
implementation because it's package-private. And it's *really* complex stuff 
that took serious effort to write and test. My personal opinion is that if we 
provide a public API for low-level index tinkering (IndexInput) then it would 
be a nice think to also make some crucial implementation bits that implement 
those interfaces available. As a programmer it makes me feel bad to copy/paste 
those bits over to my codebase just because of a package-private scope on the 
constructor (sure, we still do it a lot).

There are pros and cons to both choices I guess, but I think it's worth leaving 
reusable (or customizable) implementation classes of public API interfaces a 
bit more open.

> RAMDirectory speed improvements and cleanup
> -------------------------------------------
>                 Key: LUCENE-8438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8438
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>         Attachments: capture-1.png, capture-4.png
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
> RAMDirectory screams for a cleanup. It is used and abused in many places and 
> even if we discourage its use in favor of native (mmapped) buffers, there 
> seem to be benefits of keeping RAMDirectory available (quick throw-away 
> indexes without the need to setup external tmpfs, for example).
> Currently RAMDirectory performs very poorly under concurrent loads. The 
> implementation is also open for all sorts of abuses – the streams can be 
> reset and are used all around the place as temporary buffers, even without 
> the presence of RAMDirectory itself. This complicates the implementation and 
> is pretty confusing.
> An example of how dramatically slow RAMDirectory is under concurrent load, 
> consider this PoC pseudo-benchmark. It creates a single monolithic segment 
> with 500K very short documents (single field, with norms). The index is ~60MB 
> once created. We then run semi-complex Boolean queries on top of that index 
> from N concurrent threads. The attached capture-4 shows the result (queries 
> per second over 5-second spans) for a varying number of concurrent threads on 
> an AWS machine with 32 CPUs available (of which it seems 16 seem to be real, 
> 16 hyper-threaded). That red line at the bottom (which drops compared to a 
> single-threaded performance) is the current RAMDirectory. RAMDirectory2 is an 
> alternative implementation I wrote that uses ByteBuffers. Yes, it's slower 
> than the native mmapped implementation, but a *lot* faster then the current 
> RAMDirectory (and more GC-friendly because it uses dynamic progressive block 
> scaling internally).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to