paulirwin commented on issue #271: URL: https://github.com/apache/lucenenet/issues/271#issuecomment-2508737496
I spent quite a bit of time looking at this over the past few days, and I've come to what I think is a surprising idea, or at least it would have surprised past me when I started looking into this: I think we should remove IDisposable from TokenStream and its descendants completely, in favor of ICloseable. This is, I believe, the simplest solution, as well as the one that most closely matches Lucene. Then we avoid the whole IDisposable problem completely. As noted above, these token streams are intended to be reused after Close is called. The fact that they are stored in an Analyzer via a "ReuseStrategy" speaks to this. So we know that having ICloseable is the right design choice here. This has the added benefit of more closely matching Lucene. I went through and moved all reusable close logic to Close overrides, and that was pretty easy (although time-consuming to go through ~200 implementations to review them all). I made Dispose call Close, because Close is supposed to be idempotent, so it should be able to be called multiple times, i.e. if you call `.Close()` and then `.Dispose()` that is allowed. I do not think this violates anything (a concern expressed above), because any resources cleaned up by Close would thus need to be cleaned up by Dispose if Close is not called, but the opposite may not be true, so it's not violating any rules in two directions. This works, all tests pass, but there's a concern I had about object lifetimes, and who should be calling Dispose. That started down the rabbit hole that led to this proposal. The problem with IDisposable TokenStreams is that you're expected to dispose of IDisposables when you're done, and as noted above, TokenStreams are intended to be reused by the Analyzer, per the ReuseStrategy. We cannot make ReuseStrategy disposable, because it is supposed to use the singleton instances `Analyzer.PER_FIELD_REUSE_STRATEGY` and `Analyzer.GLOBAL_REUSE_STRATEGY`. There's also no real need, as those classes just encapsulate behavior and not state. The state is stored in the Analyzer, in the `storedValue` property. Now, we could make Analyzer disposable, make Analyzer's `DisposableThreadLocal<T>` cascade the Dispose call to its state if its state is disposable, and create a disposable wrapper around JCG.Dictionary for the PerFieldReuseStrategy (so that it would thus dispose of all of its dictionary values), but I actually think this might be unnecessary complexity, because then we would need to make sure that Analyzers are disposed everywhere, etc. I asked myself: why is this not a problem in Lucene? And after digging, and asking ChatGPT, the answer is that Analyzers and TokenStreams are not supposed to hold state that can't be garbage collected or at least released via Close (in the case of TokenStreams). And if you do need one that has to be disposed, then you can track the lifetime of that outside of the analyzer, and dispose of it yourself, because users are usually instantiating Analyzers directly anyways. There's nothing stopping someone from adding IDisposable to their custom Tokenizer or even Analyzer... but we don't have to make Lucene.NET unnecessarily complex and divergent from Lucene in order to accommodate this. (Aside: there was a thought above that in Java, AutoCloseable means that Close is sometimes called implicitly, which implies _automatically_, and this is not exactly true. It is called implicitly when the AutoCloseable is used in a try-with-resources statement (akin to a using statement in C#), but it is not called _automatically_ as part of resource cleanup.) You also might be asking: what about the IDisposable TextReader in Tokenizer? Well, that is already going to be disposed in Close as a result of this change, but also it is passed into the Tokenizer via SetReader, so really it's the responsibility of the caller to dispose of it if Close is not called. The only case where this realistically could be a problem (since most cases use StringReader which is safely GC-able without being disposed) is if you use a TextReader Field value to some non-GC-able implementation of TextReader, in which case you're probably already safely disposing of that in a using block/statement anyways. There is very little good reason within the Lucene.NET code to have Disposable, as there are only _two_ cases that I found in the entire codebase, after reviewing all TokenStream subclasses, where the cleanup logic is not reusable (and thus would not already have its Dispose logic moved into Close): `PrefixTreeStrategy.CellTokenStream` and `MemoryIndex.TokenStreamAnonymousClass<T>`. `PrefixTreeStrategy.CellTokenStream` has an `IEnumerator<Cell>` field that is passed into its constructor, which is why it is not reusable (the enumerator is not reassigned/reinitialized inside this class). But it is just enumerating over an in-memory list of cells, so there's no concern about leaking this (it would just be garbage collected). Slightly sloppy code, but nothing to lose sleep over. We could easily dispose of the enumerator in a Close method, even though that's not reusable, as it's only used by the encapsulating Field, and is basically a private implementation detail. We could make a LUCENENET-specific change to have this type take an IEnumerable instead in its ctor, dispose of the enumerator in Close, and re-initialize it if it's attempted to be reused. Based on the comment at the call site, this seems like it might have been what was intended. Or, we could take [the latest code](https://github.com/apache/lucene/blob/06a320a53e5ed69c382ad31498738b7b13ccfc45/lucene/ spatial-extras/src/java/org/apache/lucene/spatial/prefix/PrefixTreeStrategy.java#L142C3-L152C4) which avoids this problem entirely (and does not need to be disposed). I think I'd prefer the latter, if possible. `MemoryIndex.TokenStreamAnonymousClass<T>` likewise holds an `IEnumerator<T>` passed into its ctor, and likewise is created from an ICollection that likely can be safely GC'ed if not disposed (although there are no usages to its caller `KeywordTokenStream<T>(ICollection<T> keywords)`, even in tests, so it's hard to tell). Upstream, latest Lucene has [a comment](https://github.com/apache/lucene/blob/06a320a53e5ed69c382ad31498738b7b13ccfc45/lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java#L554) indicating that this method possibly should be deprecated. But regardless, it's apparently safe if this is not reused. So out of the O(200) derived types from TokenStream, only _two_ have non-reusable state that needs to be cleaned up, and even then, both are private implementation details, and it appears that the only valid intended of these is with in-memory data where it's safe to not clean up the enumerator in Dispose, and one of them even has updated code that avoids the problem. So we'd be going through the hassle of making TokenStream and its descendants IDisposable for no discernible benefit, when users that need their custom implementations to be disposable can handle that outside of our library code with a custom or anonymous Analyzer. So if we remove IDisposable from TokenStream in favor of ICloseable, it simplifies this dramatically, and avoids a lot of the concerns expressed in the comments above. We can still have `.Close()` called in try/finally blocks where it makes sense to do so, to make sure that's called if an exception occurs, but this way we don't have to worry about someone accidentally disposing of a TokenStream via a using statement or otherwise when it's possible it might be reused. And we get the added benefit of ensuring that these types are reusable (whereas they are not always today). I'm going to prototype this approach to determine if it's viable, but I have a hard time believing it wouldn't be at this point. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org