mikemccand commented on PR #15616:
URL: https://github.com/apache/lucene/pull/15616#issuecomment-3805120544

   > > I'm just noticing we have two PRs addressing the same issue, this one 
and #15610.
   > 
   > @stefanvodita I think I am fine with either solution, this one preserves 
the initial test state, so should be fine? Except for changing the backward 
codecs, the implications of which I am not aware.
   
   +1 for this solution, even though having to alter the RW Codec impersonators 
is generally a red flag (explanation below).
   
   This is a fun change ;)  @stefanvodita DM'd me this AM with a "quick 
question about a Lucene PR", heh... it was not quick!  Some long (sorry) 
context/reasoning:
   
   Lucene intentionally doesn't offer users the ability to write indices in 
older formats/Codecs (to do so would reign hell on earth for Lucene devs... we 
would have to divert 100% of our efforts to back compat ... back compat on the 
read side alone is hard enough!).  Each Lucene version can only write in the 
latest format, but can read any format since (N-1).0.0 was released (e.g. all 
10.x released versions can read indices written by any Lucene release >= 
9.0.0).  If users really need to create older format indices they should use 
older Lucene releases.
   
   But we make an exception to this for our unit tests: tests have test-only 
APIs ("Codec impersonators") to write in the older formats so we can properly 
test/fix back-compat bugs in current Lucene.  This has been an awesome facility 
over time, squashing our back-compat bugs -- when there is a bug, we write a 
test that creates the index in that specific older format, runs test that shows 
the bug (to ensure we can repro), then we fix the bug and verify the test now 
passes.
   
   When a format within Codec becomes old (because we just created a 
new/better/faster/cooler writer for it), we split it apart.  Its reader goes 
into `backwards-codecs` module, so users can always read their (N-1).0.0+ 
created indices.  But its writer goes into the "test support only", i.e. 
`lucene/backwards-codecs/src/test/...`.  In theory, that writer should be an 
exact copy from core, i.e. `git mv` plus changing the package name.
   
   The impersonator we are fixing in this PR 
(`lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene84/Lucene84RWCodec.java`)
 is an example: it's test-only code for writing very very old indices, and then 
the current `testMergeStability` test case writes this old index, down to one 
segment, then opens it (with current Lucene Codec), merges, and validates a 
unary merge (one segment to one segment) doesn't alter the file sizes.  I think 
this was for an insidious bulk merge (potent stored fields merge optimization) 
bug that somehow inflated things on merge, sort of like the innocent `cp` Linux 
command accidentally expanding all the sparse 0s on copy.
   
   Having to change the impersonator source code (this PR) is generally a red 
flag -- it should ideally be read-only because it's trying to mimic precisely 
what the old Lucene release's writer was doing.  But as we see here, we can't 
always achieve this (never changing the impersonators).
   
   I tried to dig into origin story specifically for `Lucene84RWCodec.java`, 
even [using Claude Opus 4.5 for 
help](https://claude.ai/share/48fd5d10-8af4-4f5e-aa7a-9bedd9ed27c4) (it was fun 
watching it struggle/iterate!  makes me not feel so bad when I do the same, 
with git, ha.  but i don't understand why it has so much trouble accessing 
GitHub (`Network restrictions prevent cloning`).  is its virtual env somehow 
running client-side in my browser!?).  Anyway, [this 
commit](https://github.com/apache/lucene/commit/54c5dd7d6d8b486df9132e637a6d7c5871ac4664)
 was its original creation, but it doesn't yet have `compoundFormat()`.
   
   What I'm trying to understand is whether this has always been a bug with 
8.4.0+ Lucene releases?  If Lucene users tried to change compound file 
`true|false`, would it not take?  Why didn't we hear from users about this bug 
if so...?  Or, maybe back then, you only used `MergePolicy` and 
`IndexWriterConfig` to alter compound file format, and those worked?  And it 
was only later/recently that `CompoundFormat` has this setter?
   
   OK, I see, @shubhamsrkdev [already explained/linked in the other possible PR 
to fix this issue](https://github.com/apache/lucene/pull/15610).  It was the 
recent https://github.com/apache/lucene/pull/15295 (wow, big/complex change, 
thanks @shubhamsrkdev!) that moved CFS configuration into the `CompoundFormat`, 
making it mutable.  OK, phew -- prior to that, `CompoundFormat` had no mutable 
state, so Codec (including the writer impersonators) could freely/harmlessly 
return a new `CompoundFormat` instance each time.
   
   So this likely means we'll need to find all old Codec impersonators (the 
writers) and ensure they return the same `CompoundFormat` each time 
`compoundFormat()` is called.  And it is really quite hard to reason about -- 
#15295 sort of retroactively changed how all of our backwards impersonators 
chose to write CFS or not, making them less accurately impersonating the true 
(released long ago) writers?
   
   We generally shy away from having Codec writer components have mutable 
(setters) state, and favor passing in such configuration only during 
construction (I think?) so you get a stateless writer.  Do we have other places 
where we've added setters to Codec writers?  It leads to scary 
retroactively-change-all-impersonating-writers type situations like this...
   
   Separately: are we sure we still need these specific back-compat tests!?  On 
major release, we are supposed to remove any test cases testing now unsupported 
index formats -- Lucene 8.4 indices are long ago not readable (as of 10.0.0+) 
-- why do we still have them?  Let's not make the read side back compat burden 
even harder by carrying forward long-obsolete tests!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to