Re: [PR] Fix Lucene84RWCodec to cache CompoundFormat instance [lucene]

via GitHub Tue, 27 Jan 2026 05:05:25 -0800


mikemccand commented on PR #15616:
URL: https://github.com/apache/lucene/pull/15616#issuecomment-3805120544

> > I'm just noticing we have two PRs addressing the same issue, this one
and #15610.
>
> @stefanvodita I think I am fine with either solution, this one preserves
the initial test state, so should be fine? Except for changing the backward
codecs, the implications of which I am not aware.

+1 for this solution, even though having to alter the RW Codec impersonators
is generally a red flag (explanation below).

This is a fun change ;) @stefanvodita DM'd me this AM with a "quick
question about a Lucene PR", heh... it was not quick! Some long (sorry)
context/reasoning:

Lucene intentionally doesn't offer users the ability to write indices in
older formats/Codecs (to do so would reign hell on earth for Lucene devs... we
would have to divert 100% of our efforts to back compat ... back compat on the
read side alone is hard enough!). Each Lucene version can only write in the
latest format, but can read any format since (N-1).0.0 was released (e.g. all
10.x released versions can read indices written by any Lucene release >=
9.0.0). If users really need to create older format indices they should use
older Lucene releases.

But we make an exception to this for our unit tests: tests have test-only
APIs ("Codec impersonators") to write in the older formats so we can properly
test/fix back-compat bugs in current Lucene. This has been an awesome facility
over time, squashing our back-compat bugs -- when there is a bug, we write a
test that creates the index in that specific older format, runs test that shows
the bug (to ensure we can repro), then we fix the bug and verify the test now
passes.

When a format within Codec becomes old (because we just created a
new/better/faster/cooler writer for it), we split it apart. Its reader goes
into `backwards-codecs` module, so users can always read their (N-1).0.0+
created indices. But its writer goes into the "test support only", i.e.
`lucene/backwards-codecs/src/test/...`. In theory, that writer should be an
exact copy from core, i.e. `git mv` plus changing the package name.

The impersonator we are fixing in this PR
(`lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene84/Lucene84RWCodec.java`)
is an example: it's test-only code for writing very very old indices, and then
the current `testMergeStability` test case writes this old index, down to one
segment, then opens it (with current Lucene Codec), merges, and validates a
unary merge (one segment to one segment) doesn't alter the file sizes. I think
this was for an insidious bulk merge (potent stored fields merge optimization)
bug that somehow inflated things on merge, sort of like the innocent `cp` Linux
command accidentally expanding all the sparse 0s on copy.

Having to change the impersonator source code (this PR) is generally a red
flag -- it should ideally be read-only because it's trying to mimic precisely
what the old Lucene release's writer was doing. But as we see here, we can't
always achieve this (never changing the impersonators).

I tried to dig into origin story specifically for `Lucene84RWCodec.java`,
even [using Claude Opus 4.5 for
help](https://claude.ai/share/48fd5d10-8af4-4f5e-aa7a-9bedd9ed27c4) (it was fun
watching it struggle/iterate! makes me not feel so bad when I do the same,
with git, ha. but i don't understand why it has so much trouble accessing
GitHub (`Network restrictions prevent cloning`). is its virtual env somehow
running client-side in my browser!?). Anyway, [this
commit](https://github.com/apache/lucene/commit/54c5dd7d6d8b486df9132e637a6d7c5871ac4664)
was its original creation, but it doesn't yet have `compoundFormat()`.

What I'm trying to understand is whether this has always been a bug with
8.4.0+ Lucene releases? If Lucene users tried to change compound file
`true|false`, would it not take? Why didn't we hear from users about this bug
if so...? Or, maybe back then, you only used `MergePolicy` and
`IndexWriterConfig` to alter compound file format, and those worked? And it
was only later/recently that `CompoundFormat` has this setter?

OK, I see, @shubhamsrkdev [already explained/linked in the other possible PR
to fix this issue](https://github.com/apache/lucene/pull/15610). It was the
recent https://github.com/apache/lucene/pull/15295 (wow, big/complex change,
thanks @shubhamsrkdev!) that moved CFS configuration into the `CompoundFormat`,
making it mutable. OK, phew -- prior to that, `CompoundFormat` had no mutable
state, so Codec (including the writer impersonators) could freely/harmlessly
return a new `CompoundFormat` instance each time.

So this likely means we'll need to find all old Codec impersonators (the
writers) and ensure they return the same `CompoundFormat` each time
`compoundFormat()` is called. And it is really quite hard to reason about --
#15295 sort of retroactively changed how all of our backwards impersonators
chose to write CFS or not, making them less accurately impersonating the true
(released long ago) writers?

We generally shy away from having Codec writer components have mutable
(setters) state, and favor passing in such configuration only during
construction (I think?) so you get a stateless writer. Do we have other places
where we've added setters to Codec writers? It leads to scary
retroactively-change-all-impersonating-writers type situations like this...

Separately: are we sure we still need these specific back-compat tests!? On
major release, we are supposed to remove any test cases testing now unsupported
index formats -- Lucene 8.4 indices are long ago not readable (as of 10.0.0+)
-- why do we still have them? Let's not make the read side back compat burden
even harder by carrying forward long-obsolete tests!!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix Lucene84RWCodec to cache CompoundFormat instance [lucene]

Reply via email to