Hi Team,

Today, Lucene's Directory abstraction does not allow opening an IndexInput
on a file until the file is fully written and closed via IndexOutput.  We
enforce this in tests, and some of our core Directory implementations
demand this (e.g. caching the file's length on opening an IndexInput).

Yet, most filesystems will easily allow simultaneous read/append of a
single file.  We just don't expose this IO semantics to Lucene, but could
we allow random-access reads with append-only writes on one file?  Is there
a strong reason that we don't allow this?

Quick TL/DR context: we are trying to enable FST compilation to write
off-heap (directly to disk), enabling creating arbitrarily large FSTs with
bounded heap, matching how FSTs can now be read off-heap, and it would be
much much more RAM efficient if we could read/append the same file at once.

Full gory details context: inspired by how Tantivy
<https://github.com/quickwit-oss/tantivy> (awesome and fast Rust search
engine!) writes its FSTs <https://blog.burntsushi.net/transducers/>, over
in this issue <https://github.com/apache/lucene/issues/12543> and PR
<https://github.com/dungba88/lucene/commit/882f5a5b1f60d4321d2e09986335063368c08e9b>,
we (thank you Dzung Bui / @dungba88!) are trying to fix Lucene's FST
building to immediately stream the FST to disk, instead of buffering the
whole thing in RAM and then writing to disk.

This would allow building arbitrarily large FSTs without using up heap, and
symmetrically matches how we can now read FSTs off-heap, plus FST building
is already (mostly) append-only. This would also allow removing some of the
crazy abstractions we have for writing FST bytes into RAM (FSTStore,
BytesStore).  It would enable interesting things like a Codec whose term
dictionary is stored entirely in an FST
<https://github.com/apache/lucene/pull/12688> (also inspired by Tantivy).

The wrinkle is that, while the FST is building, it sometimes looks back and
reads previously written bytes, to share suffixes and create a minimal (or
near minimal) FST.  So if IndexInput could read those bytes, even as the
FST is still appending to IndexOutput, it would "just work".

Failing that, our plan B is to wastefully duplicate the byte[] slices from
the already written bytes into our own private (heap resident, boo) copy,
which would use quite a bit more RAM while building the FST, and make less
minimal FSTs for a given RAM budget.  I haven't measured the added wasted
RAM if we have to go this route but I fear it is sizable in practice, i.e.
it strongly negates the whole idea of writing an FST off-heap since its
effectively storing a possibly large portion of the FST in many duplicated
byte[] fragments (in the NodeHash).

So ... could we somehow relax Lucene's Directory semantics to allow opening
an IndexInput on a still appending IndexOutput, since most filesystems are
fine with this?

Mike McCandless

http://blog.mikemccandless.com

Reply via email to