Re: [I] TokenStream.IncrementToken() is called after Dispose() is called [lucenenet]

via GitHub Fri, 29 Nov 2024 15:47:23 -0800


paulirwin commented on issue #271:
URL: https://github.com/apache/lucenenet/issues/271#issuecomment-2508737496

I spent quite a bit of time looking at this over the past few days, and I've
come to what I think is a surprising idea, or at least it would have surprised
past me when I started looking into this: I think we should remove IDisposable
from TokenStream and its descendants completely, in favor of ICloseable. This
is, I believe, the simplest solution, as well as the one that most closely
matches Lucene. Then we avoid the whole IDisposable problem completely.

As noted above, these token streams are intended to be reused after Close is
called. The fact that they are stored in an Analyzer via a "ReuseStrategy"
speaks to this. So we know that having ICloseable is the right design choice
here. This has the added benefit of more closely matching Lucene. I went
through and moved all reusable close logic to Close overrides, and that was
pretty easy (although time-consuming to go through ~200 implementations to
review them all).

I made Dispose call Close, because Close is supposed to be idempotent, so it
should be able to be called multiple times, i.e. if you call `.Close()` and
then `.Dispose()` that is allowed. I do not think this violates anything (a
concern expressed above), because any resources cleaned up by Close would thus
need to be cleaned up by Dispose if Close is not called, but the opposite may
not be true, so it's not violating any rules in two directions. This works, all
tests pass, but there's a concern I had about object lifetimes, and who should
be calling Dispose. That started down the rabbit hole that led to this proposal.

The problem with IDisposable TokenStreams is that you're expected to dispose
of IDisposables when you're done, and as noted above, TokenStreams are intended
to be reused by the Analyzer, per the ReuseStrategy. We cannot make
ReuseStrategy disposable, because it is supposed to use the singleton instances
`Analyzer.PER_FIELD_REUSE_STRATEGY` and `Analyzer.GLOBAL_REUSE_STRATEGY`.
There's also no real need, as those classes just encapsulate behavior and not
state. The state is stored in the Analyzer, in the `storedValue` property. Now,
we could make Analyzer disposable, make Analyzer's `DisposableThreadLocal<T>`
cascade the Dispose call to its state if its state is disposable, and create a
disposable wrapper around JCG.Dictionary for the PerFieldReuseStrategy (so that
it would thus dispose of all of its dictionary values), but I actually think
this might be unnecessary complexity, because then we would need to make sure
that Analyzers are disposed everywhere, etc.

I asked myself: why is this not a problem in Lucene? And after digging, and
asking ChatGPT, the answer is that Analyzers and TokenStreams are not supposed
to hold state that can't be garbage collected or at least released via Close
(in the case of TokenStreams). And if you do need one that has to be disposed,
then you can track the lifetime of that outside of the analyzer, and dispose of
it yourself, because users are usually instantiating Analyzers directly
anyways. There's nothing stopping someone from adding IDisposable to their
custom Tokenizer or even Analyzer... but we don't have to make Lucene.NET
unnecessarily complex and divergent from Lucene in order to accommodate this.

(Aside: there was a thought above that in Java, AutoCloseable means that
Close is sometimes called implicitly, which implies _automatically_, and this
is not exactly true. It is called implicitly when the AutoCloseable is used in
a try-with-resources statement (akin to a using statement in C#), but it is not
called _automatically_ as part of resource cleanup.)

You also might be asking: what about the IDisposable TextReader in
Tokenizer? Well, that is already going to be disposed in Close as a result of
this change, but also it is passed into the Tokenizer via SetReader, so really
it's the responsibility of the caller to dispose of it if Close is not called.
The only case where this realistically could be a problem (since most cases use
StringReader which is safely GC-able without being disposed) is if you use a
TextReader Field value to some non-GC-able implementation of TextReader, in
which case you're probably already safely disposing of that in a using
block/statement anyways.

There is very little good reason within the Lucene.NET code to have
Disposable, as there are only _two_ cases that I found in the entire codebase,
after reviewing all TokenStream subclasses, where the cleanup logic is not
reusable (and thus would not already have its Dispose logic moved into Close):
`PrefixTreeStrategy.CellTokenStream` and
`MemoryIndex.TokenStreamAnonymousClass<T>`.

`PrefixTreeStrategy.CellTokenStream` has an `IEnumerator<Cell>` field that
is passed into its constructor, which is why it is not reusable (the enumerator
is not reassigned/reinitialized inside this class). But it is just enumerating
over an in-memory list of cells, so there's no concern about leaking this (it
would just be garbage collected). Slightly sloppy code, but nothing to lose
sleep over. We could easily dispose of the enumerator in a Close method, even
though that's not reusable, as it's only used by the encapsulating Field, and
is basically a private implementation detail. We could make a
LUCENENET-specific change to have this type take an IEnumerable instead in its
ctor, dispose of the enumerator in Close, and re-initialize it if it's
attempted to be reused. Based on the comment at the call site, this seems like
it might have been what was intended. Or, we could take [the latest
code](https://github.com/apache/lucene/blob/06a320a53e5ed69c382ad31498738b7b13ccfc45/lucene/

spatial-extras/src/java/org/apache/lucene/spatial/prefix/PrefixTreeStrategy.java#L142C3-L152C4)
which avoids this problem entirely (and does not need to be disposed). I think
I'd prefer the latter, if possible.

`MemoryIndex.TokenStreamAnonymousClass<T>` likewise holds an
`IEnumerator<T>` passed into its ctor, and likewise is created from an
ICollection that likely can be safely GC'ed if not disposed (although there are
no usages to its caller `KeywordTokenStream<T>(ICollection<T> keywords)`, even
in tests, so it's hard to tell). Upstream, latest Lucene has [a
comment](https://github.com/apache/lucene/blob/06a320a53e5ed69c382ad31498738b7b13ccfc45/lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java#L554)
indicating that this method possibly should be deprecated. But regardless,
it's apparently safe if this is not reused.

So out of the O(200) derived types from TokenStream, only _two_ have
non-reusable state that needs to be cleaned up, and even then, both are private
implementation details, and it appears that the only valid intended of these is
with in-memory data where it's safe to not clean up the enumerator in Dispose,
and one of them even has updated code that avoids the problem. So we'd be going
through the hassle of making TokenStream and its descendants IDisposable for no
discernible benefit, when users that need their custom implementations to be
disposable can handle that outside of our library code with a custom or
anonymous Analyzer.

So if we remove IDisposable from TokenStream in favor of ICloseable, it
simplifies this dramatically, and avoids a lot of the concerns expressed in the
comments above. We can still have `.Close()` called in try/finally blocks where
it makes sense to do so, to make sure that's called if an exception occurs, but
this way we don't have to worry about someone accidentally disposing of a
TokenStream via a using statement or otherwise when it's possible it might be
reused. And we get the added benefit of ensuring that these types are reusable
(whereas they are not always today).

I'm going to prototype this approach to determine if it's viable, but I have
a hard time believing it wouldn't be at this point.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] TokenStream.IncrementToken() is called after Dispose() is called [lucenenet]

Reply via email to