NightOwl888 commented on issue #271: URL: https://github.com/apache/lucenenet/issues/271#issuecomment-2509096553
Thanks so much for the detailed analysis, Paul. So, what I gather, you are saying a disposable class would be injected like this? ```c# using var someDisposable = new SomeDisposable(); Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => { var src = new StandardTokenizer(m_matchVersion, reader); src.MaxTokenLength = 150; TokenStream tok = new StandardFilter(m_matchVersion, src); tok = new LowerCaseFilter(m_matchVersion, tok); tok = new StopFilter(m_matchVersion, tok, m_stopwords); tok = new MyFilter(someDisposable, tok); return new TokenStreamComponents(src, tok); }); ``` Alternatively, the user would need to implement a whole Analyzer subclass, but they could pass the disposable instance into the constructor to accomplish the same thing. I have to admit, doing it that way hadn't crossed my mind, but I am not sure it covers everything we need to fix. It may address all usability issues, though. The main motivation for this issue is to fix the [`BaseTokenStreamTestCase` class](https://github.com/apache/lucenenet/blob/abe75de14adb010fd4ac2b614cf5af2b8157e22b/src/Lucene.Net.TestFramework/Analysis/BaseTokenStreamTestCase.cs#L389-L400). We have added hacks to prevent cascade failures from occurring with the original design. But this has bled over into other areas, such as the two `TestRandomChains` test failures. IIRC, both of these things require the state of `ILLEGAL_STATE_READER` to be carefully managed. There were 3 main issues we ran into with `BaseTokenStreamTestCase`: 1. If an assert happened, it could leave the `TokenStream` in an inconsistent state because it would fail to fulfill the [TokenStream contract](https://lucenenet.apache.org/docs/4.8.0-beta00017/api/core/Lucene.Net.Analysis.TokenStream.html). This would in turn cause all of the remaining tests in the same test class to fail because the `TokenStream` is in an invalid state each time it is reused. 2. If a `TokenStream` class owns a file handle and we skip the call `Dispose()` on it, other tests that use the same file would get a "File in use by another process" exception. This is probably not an issue in our tests since we embedded the data files as resources into the assemblies (which means they open using a managed stream instance), but I see no reason to impose a limitation of strictly using embedded resources for our users. 3. If a `TokenStream` class owns a native resource handle (as was once the case when we were using icu-dotnet), there is also no way to clean it up using Close() since it needs to happen after all of the reuse is done. It was the 3rd problem that made me open this issue, as using icu-dotnet with Lucene.NET was completely impossible without having a `Dispose()` method that was only called once at the end of the process. The above solution would have been very ugly to hand our users - imagine having to [instruct users to `Init()` and `Cleanup()`](https://github.com/sillsdev/icu-dotnet?tab=readme-ov-file#usage) in every piece of code that uses a particular analyzer implementation (at one point, those were required). And what if a future implementation we own requires us to call something disposable? My thought was to separate the concepts of `Close()` from `Dispsose()` so the test framework could clean up a reusable `TokenStream` instance without modifying the `ILLEGAL_STATE_READER` so it doesn't cause all of the other tests in the class to fail along with it. But, IIRC there has never been an attempt to make a failure happen in Lucene to see if it had the same cascade failures or if there is some other hook that is supposed to clean up this state. Having those cascade failures made finding the test that was actually failing very hard. If Lucene does not have cascade failures, I am curious as to why they don't. > Of course, we have changed a lot of things since those hacks were put in place, such as fixing the error handling to work the same way it did in Lucene and switching from external resource files to embedded resources. A good proof of concept would be to ensure the `TestRandomChains` test failures can be patched with the proposed fix. I would be interested to know 1. Can we make those pass without having disposable analysis components? 2. Can we make the tests fail independently without getting cascade failures? Even on `TextReader` implementations that own file handles? It is that last condition that made me think there was no way out except for having disposable token stream components, since the original configuration would always leave the file handle open upon test failure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org