Hi Anh Dũng Bùi,

Thank you for tackling these and being so gently patient/persisting!  Sorry
for the delay.  I will try to review them soon.  The off-heap (streaming?)
building of FSTs is really a massive improvement to Lucene, inspired by
Tantivy's FST implementation: https://blog.burntsushi.net/transducers/

Read-time for Lucene90BlockTreePostingsFormat was already off-heap?  And
your PR changes write-time to do so as well?  This will reduce RAM pressure
during indexing which is great.  And some Lucene usages generate incredibly
large FSTs (I'm looking at you HathiTrust!). I don't think we need to
explicitly measure any performance impact before merging?, but let's watch
the nightly benchy to see if there is any measurable impact?

And, yes, Lucene90BlockTreePostingsFormat is the default.  You find the
default codec from Codec.getDefault() and then trace downwards to all its
sources.

Maybe building the synonyms FST (SynonymMap.Builder) would be a good place
for off-heap writing too?

And this exciting PR <https://github.com/apache/lucene/pull/12688> (still a
work in progres) would likely strongly benefit from streaming FST building,
since its FSTs will be much larger than the Lucene90BlockTree since it
stores all terms (not just the sampled prefix/index) in a single FST for
the segment.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 1, 2024 at 10:40 PM Anh Dũng Bùi <dungba...@gmail.com> wrote:

> Hi Lucene devs!
>
> I have 2 PRs to optimize Lucene PostingsFormat
> (Lucene90BlockTreePostingsFormat and FSTPostingsFormat) by utilizing a new
> feature to stream the FST to IndexOutput directly, bypassing the on-heap
> writing:
> - https://github.com/apache/lucene/pull/12980
> - https://github.com/apache/lucene/pull/12985
>
> It would be great if someone can help reviewing. I also have some general
> questions:
> - How do I measure the memory improvement impact in Lucene?
> - Is Lucene90BlockTreePostingsFormat the main index format used in Lucene?
> If not, what is the main format?
> - Are there other places worth using the new streaming FST feature?
>
> Thank you!
> Anh Dung Bui
>

Reply via email to