Hi Anh Dũng Bùi, Thank you for tackling these and being so gently patient/persisting! Sorry for the delay. I will try to review them soon. The off-heap (streaming?) building of FSTs is really a massive improvement to Lucene, inspired by Tantivy's FST implementation: https://blog.burntsushi.net/transducers/
Read-time for Lucene90BlockTreePostingsFormat was already off-heap? And your PR changes write-time to do so as well? This will reduce RAM pressure during indexing which is great. And some Lucene usages generate incredibly large FSTs (I'm looking at you HathiTrust!). I don't think we need to explicitly measure any performance impact before merging?, but let's watch the nightly benchy to see if there is any measurable impact? And, yes, Lucene90BlockTreePostingsFormat is the default. You find the default codec from Codec.getDefault() and then trace downwards to all its sources. Maybe building the synonyms FST (SynonymMap.Builder) would be a good place for off-heap writing too? And this exciting PR <https://github.com/apache/lucene/pull/12688> (still a work in progres) would likely strongly benefit from streaming FST building, since its FSTs will be much larger than the Lucene90BlockTree since it stores all terms (not just the sampled prefix/index) in a single FST for the segment. Mike McCandless http://blog.mikemccandless.com On Thu, Feb 1, 2024 at 10:40 PM Anh Dũng Bùi <dungba...@gmail.com> wrote: > Hi Lucene devs! > > I have 2 PRs to optimize Lucene PostingsFormat > (Lucene90BlockTreePostingsFormat and FSTPostingsFormat) by utilizing a new > feature to stream the FST to IndexOutput directly, bypassing the on-heap > writing: > - https://github.com/apache/lucene/pull/12980 > - https://github.com/apache/lucene/pull/12985 > > It would be great if someone can help reviewing. I also have some general > questions: > - How do I measure the memory improvement impact in Lucene? > - Is Lucene90BlockTreePostingsFormat the main index format used in Lucene? > If not, what is the main format? > - Are there other places worth using the new streaming FST feature? > > Thank you! > Anh Dung Bui >