Re: [I] Review all Encoding usage for BOM compatibility [lucenenet]

via GitHub Thu, 26 Dec 2024 06:07:45 -0800


NightOwl888 commented on issue #1027:
URL: https://github.com/apache/lucenenet/issues/1027#issuecomment-2562788290

Looks like you missed `OfflineSorter`. The tests specifically failed when it
was configured to use a BOM, although I didn't analyze it at a high level to
find out why that was the case. No objections if you wish to investigate this,
but it definitely makes a difference as far as the tests are concerned.

It has gone through several rounds of refactoring since then, but currently
it has a
[`DEFAULT_ENCODING`](https://github.com/apache/lucenenet/blob/85c01412946ed1e2632cd2dfae4c672efd38caba/src/Lucene.Net/Util/OfflineSorter.cs#L44-L48)
field that we added to ensure the tests pass. So, we have a couple of options:

1. Remove the `DEFAULT_ENCODING` field and replace it with
`IOUtils.CHARSET_UTF_8`. Update the OfflineSorter documentation for
`ByteSequencesReader` and `ByteSequencesWriter` to indicate that constructor
overloads that accept `BinaryReader` and `BinaryWriter` should use
`IOUtils.CHARSET_UTF_8`.
2. Initialize the `DEFAULT_ENCODING` field with the same instance as
`IOUtils.CHARSET_UTF_8`.

Given the fact that we added this field specifically because `OfflineSorter`
requires there to be no `BOM` (which difers from the .NET default), this could
go either way. Given that we recently changed `IOUtils.CHARSET_UTF_8` to remove
the BOM, using it wasn't an option when the `DEFAULT_ENCODING` field was added.
If it were, it would have been reused in this case and the field wouldn't have
been added.

Side note: perhaps we should also rename `IOUtils.CHARSET_UTF_8` because it
is public and "CharSet" is Java nomenclature. `ENCODING_UTF8_NO_BOM` would be a
better name.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Review all Encoding usage for BOM compatibility [lucenenet]

Reply via email to