NightOwl888 commented on issue #1027: URL: https://github.com/apache/lucenenet/issues/1027#issuecomment-2562788290
Looks like you missed `OfflineSorter`. The tests specifically failed when it was configured to use a BOM, although I didn't analyze it at a high level to find out why that was the case. No objections if you wish to investigate this, but it definitely makes a difference as far as the tests are concerned. It has gone through several rounds of refactoring since then, but currently it has a [`DEFAULT_ENCODING`](https://github.com/apache/lucenenet/blob/85c01412946ed1e2632cd2dfae4c672efd38caba/src/Lucene.Net/Util/OfflineSorter.cs#L44-L48) field that we added to ensure the tests pass. So, we have a couple of options: 1. Remove the `DEFAULT_ENCODING` field and replace it with `IOUtils.CHARSET_UTF_8`. Update the OfflineSorter documentation for `ByteSequencesReader` and `ByteSequencesWriter` to indicate that constructor overloads that accept `BinaryReader` and `BinaryWriter` should use `IOUtils.CHARSET_UTF_8`. 2. Initialize the `DEFAULT_ENCODING` field with the same instance as `IOUtils.CHARSET_UTF_8`. Given the fact that we added this field specifically because `OfflineSorter` requires there to be no `BOM` (which difers from the .NET default), this could go either way. Given that we recently changed `IOUtils.CHARSET_UTF_8` to remove the BOM, using it wasn't an option when the `DEFAULT_ENCODING` field was added. If it were, it would have been reused in this case and the field wouldn't have been added. Side note: perhaps we should also rename `IOUtils.CHARSET_UTF_8` because it is public and "CharSet" is Java nomenclature. `ENCODING_UTF8_NO_BOM` would be a better name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org