paulirwin commented on issue #271:
URL: https://github.com/apache/lucenenet/issues/271#issuecomment-2508737496

   I spent quite a bit of time looking at this over the past few days, and I've 
come to what I think is a surprising idea, or at least it would have surprised 
past me when I started looking into this: I think we should remove IDisposable 
from TokenStream and its descendants completely, in favor of ICloseable. This 
is, I believe, the simplest solution, as well as the one that most closely 
matches Lucene. Then we avoid the whole IDisposable problem completely.
   
   As noted above, these token streams are intended to be reused after Close is 
called. The fact that they are stored in an Analyzer via a "ReuseStrategy" 
speaks to this. So we know that having ICloseable is the right design choice 
here. This has the added benefit of more closely matching Lucene. I went 
through and moved all reusable close logic to Close overrides, and that was 
pretty easy (although time-consuming to go through ~200 implementations to 
review them all).
   
   I made Dispose call Close, because Close is supposed to be idempotent, so it 
should be able to be called multiple times, i.e. if you call `.Close()` and 
then `.Dispose()` that is allowed. I do not think this violates anything (a 
concern expressed above), because any resources cleaned up by Close would thus 
need to be cleaned up by Dispose if Close is not called, but the opposite may 
not be true, so it's not violating any rules in two directions. This works, all 
tests pass, but there's a concern I had about object lifetimes, and who should 
be calling Dispose. That started down the rabbit hole that led to this proposal.
   
   The problem with IDisposable TokenStreams is that you're expected to dispose 
of IDisposables when you're done, and as noted above, TokenStreams are intended 
to be reused by the Analyzer, per the ReuseStrategy. We cannot make 
ReuseStrategy disposable, because it is supposed to use the singleton instances 
`Analyzer.PER_FIELD_REUSE_STRATEGY` and `Analyzer.GLOBAL_REUSE_STRATEGY`. 
There's also no real need, as those classes just encapsulate behavior and not 
state. The state is stored in the Analyzer, in the `storedValue` property. Now, 
we could make Analyzer disposable, make Analyzer's `DisposableThreadLocal<T>` 
cascade the Dispose call to its state if its state is disposable, and create a 
disposable wrapper around JCG.Dictionary for the PerFieldReuseStrategy (so that 
it would thus dispose of all of its dictionary values), but I actually think 
this might be unnecessary complexity, because then we would need to make sure 
that Analyzers are disposed everywhere, etc.
   
   I asked myself: why is this not a problem in Lucene? And after digging, and 
asking ChatGPT, the answer is that Analyzers and TokenStreams are not supposed 
to hold state that can't be garbage collected or at least released via Close 
(in the case of TokenStreams). And if you do need one that has to be disposed, 
then you can track the lifetime of that outside of the analyzer, and dispose of 
it yourself, because users are usually instantiating Analyzers directly 
anyways. There's nothing stopping someone from adding IDisposable to their 
custom Tokenizer or even Analyzer... but we don't have to make Lucene.NET 
unnecessarily complex and divergent from Lucene in order to accommodate this.
   
   (Aside: there was a thought above that in Java, AutoCloseable means that 
Close is sometimes called implicitly, which implies _automatically_, and this 
is not exactly true. It is called implicitly when the AutoCloseable is used in 
a try-with-resources statement (akin to a using statement in C#), but it is not 
called _automatically_ as part of resource cleanup.)
   
   You also might be asking: what about the IDisposable TextReader in 
Tokenizer? Well, that is already going to be disposed in Close as a result of 
this change, but also it is passed into the Tokenizer via SetReader, so really 
it's the responsibility of the caller to dispose of it if Close is not called. 
The only case where this realistically could be a problem (since most cases use 
StringReader which is safely GC-able without being disposed) is if you use a 
TextReader Field value to some non-GC-able implementation of TextReader, in 
which case you're probably already safely disposing of that in a using 
block/statement anyways.
   
   There is very little good reason within the Lucene.NET code to have 
Disposable, as there are only _two_ cases that I found in the entire codebase, 
after reviewing all TokenStream subclasses, where the cleanup logic is not 
reusable (and thus would not already have its Dispose logic moved into Close): 
`PrefixTreeStrategy.CellTokenStream` and 
`MemoryIndex.TokenStreamAnonymousClass<T>`.
   
   `PrefixTreeStrategy.CellTokenStream` has an `IEnumerator<Cell>` field that 
is passed into its constructor, which is why it is not reusable (the enumerator 
is not reassigned/reinitialized inside this class). But it is just enumerating 
over an in-memory list of cells, so there's no concern about leaking this (it 
would just be garbage collected). Slightly sloppy code, but nothing to lose 
sleep over. We could easily dispose of the enumerator in a Close method, even 
though that's not reusable, as it's only used by the encapsulating Field, and 
is basically a private implementation detail. We could make a 
LUCENENET-specific change to have this type take an IEnumerable instead in its 
ctor, dispose of the enumerator in Close, and re-initialize it if it's 
attempted to be reused. Based on the comment at the call site, this seems like 
it might have been what was intended. Or, we could take [the latest 
code](https://github.com/apache/lucene/blob/06a320a53e5ed69c382ad31498738b7b13ccfc45/lucene/
 
spatial-extras/src/java/org/apache/lucene/spatial/prefix/PrefixTreeStrategy.java#L142C3-L152C4)
 which avoids this problem entirely (and does not need to be disposed). I think 
I'd prefer the latter, if possible.
   
   `MemoryIndex.TokenStreamAnonymousClass<T>` likewise holds an 
`IEnumerator<T>` passed into its ctor, and likewise is created from an 
ICollection that likely can be safely GC'ed if not disposed (although there are 
no usages to its caller `KeywordTokenStream<T>(ICollection<T> keywords)`, even 
in tests, so it's hard to tell). Upstream, latest Lucene has [a 
comment](https://github.com/apache/lucene/blob/06a320a53e5ed69c382ad31498738b7b13ccfc45/lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java#L554)
 indicating that this method possibly should be deprecated. But regardless, 
it's apparently safe if this is not reused.
   
   So out of the O(200) derived types from TokenStream, only _two_ have 
non-reusable state that needs to be cleaned up, and even then, both are private 
implementation details, and it appears that the only valid intended of these is 
with in-memory data where it's safe to not clean up the enumerator in Dispose, 
and one of them even has updated code that avoids the problem. So we'd be going 
through the hassle of making TokenStream and its descendants IDisposable for no 
discernible benefit, when users that need their custom implementations to be 
disposable can handle that outside of our library code with a custom or 
anonymous Analyzer.
   
   So if we remove IDisposable from TokenStream in favor of ICloseable, it 
simplifies this dramatically, and avoids a lot of the concerns expressed in the 
comments above. We can still have `.Close()` called in try/finally blocks where 
it makes sense to do so, to make sure that's called if an exception occurs, but 
this way we don't have to worry about someone accidentally disposing of a 
TokenStream via a using statement or otherwise when it's possible it might be 
reused. And we get the added benefit of ensuring that these types are reusable 
(whereas they are not always today).
   
   I'm going to prototype this approach to determine if it's viable, but I have 
a hard time believing it wouldn't be at this point. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to