Re: [I] TokenStream.IncrementToken() is called after Dispose() is called [lucenenet]

via GitHub Sat, 30 Nov 2024 09:53:38 -0800


NightOwl888 commented on issue #271:
URL: https://github.com/apache/lucenenet/issues/271#issuecomment-2509096553


   Thanks so much for the detailed analysis, Paul.
   
   So, what I gather, you are saying a disposable class would be injected like 
this?
   
   ```c#
   using var someDisposable = new SomeDisposable();
   
   Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, 
reader) =>
   {
       var src = new StandardTokenizer(m_matchVersion, reader);
       src.MaxTokenLength = 150;
       TokenStream tok = new StandardFilter(m_matchVersion, src);
       tok = new LowerCaseFilter(m_matchVersion, tok);
       tok = new StopFilter(m_matchVersion, tok, m_stopwords);
       tok = new MyFilter(someDisposable, tok);
       return new TokenStreamComponents(src, tok);
   });
   ```
   
   Alternatively, the user would need to implement a whole Analyzer subclass, 
but they could pass the disposable instance into the constructor to accomplish 
the same thing. I have to admit, doing it that way hadn't crossed my mind, but 
I am not sure it covers everything we need to fix. It may address all usability 
issues, though.
   
   The main motivation for this issue is to fix the [`BaseTokenStreamTestCase` 
class](https://github.com/apache/lucenenet/blob/abe75de14adb010fd4ac2b614cf5af2b8157e22b/src/Lucene.Net.TestFramework/Analysis/BaseTokenStreamTestCase.cs#L389-L400).
 We have added hacks to prevent cascade failures from occurring with the 
original design. But this has bled over into other areas, such as the two 
`TestRandomChains` test failures. IIRC, both of these things require the state 
of `ILLEGAL_STATE_READER` to be carefully managed.
   
   There were 3 main issues we ran into with `BaseTokenStreamTestCase`:
   
   1. If an assert happened, it could leave the `TokenStream` in an 
inconsistent state because it would fail to fulfill the [TokenStream 
contract](https://lucenenet.apache.org/docs/4.8.0-beta00017/api/core/Lucene.Net.Analysis.TokenStream.html).
 This would in turn cause all of the remaining tests in the same test class to 
fail because the `TokenStream` is in an invalid state each time it is reused.
   2. If a `TokenStream` class owns a file handle and we skip the call 
`Dispose()` on it, other tests that use the same file would get a "File in use 
by another process" exception. This is probably not an issue in our tests since 
we embedded the data files as resources into the assemblies (which means they 
open using a managed stream instance), but I see no reason to impose a 
limitation of strictly using embedded resources for our users.
   3. If a `TokenStream` class owns a native resource handle (as was once the 
case when we were using icu-dotnet), there is also no way to clean it up using 
Close() since it needs to happen after all of the reuse is done.
   
   It was the 3rd problem that made me open this issue, as using icu-dotnet 
with Lucene.NET was completely impossible without having a `Dispose()` method 
that was only called once at the end of the process. The above solution would 
have been very ugly to hand our users - imagine having to [instruct users to 
`Init()` and 
`Cleanup()`](https://github.com/sillsdev/icu-dotnet?tab=readme-ov-file#usage) 
in every piece of code that uses a particular analyzer implementation (at one 
point, those were required). And what if a future implementation we own 
requires us to call something disposable?
   
   My thought was to separate the concepts of `Close()` from `Dispsose()` so 
the test framework could clean up a reusable `TokenStream` instance without 
modifying the `ILLEGAL_STATE_READER` so it doesn't cause all of the other tests 
in the class to fail along with it. But, IIRC there has never been an attempt 
to make a failure happen in Lucene to see if it had the same cascade failures 
or if there is some other hook that is supposed to clean up this state. Having 
those cascade failures made finding the test that was actually failing very 
hard. If Lucene does not have cascade failures, I am curious as to why they 
don't.
   
   > Of course, we have changed a lot of things since those hacks were put in 
place, such as fixing the error handling to work the same way it did in Lucene 
and switching from external resource files to embedded resources.
   
   A good proof of concept would be to ensure the `TestRandomChains` test 
failures can be patched with the proposed fix. I would be interested to know
   
   1. Can we make those pass without having disposable analysis components?
   2. Can we make the tests fail independently without getting cascade 
failures? Even on `TextReader` implementations that own file handles?
   
   It is that last condition that made me think there was no way out except for 
having disposable token stream components, since the original configuration 
would always leave the file handle open upon test failure.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] TokenStream.IncrementToken() is called after Dispose() is called [lucenenet]

Reply via email to