Re: [PR] Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 [lucenenet]

via GitHub Thu, 09 Jan 2025 20:03:28 -0800


paulirwin commented on code in PR #1089:
URL: https://github.com/apache/lucenenet/pull/1089#discussion_r1909785984



##########
src/Lucene.Net/Support/Text/EncodingExtensions.cs:
##########
@@ -0,0 +1,51 @@
+using System.Text;
+
+namespace Lucene.Net.Support.Text
+{
+    /*
+     * Licensed to the Apache Software Foundation (ASF) under one or more
+     * contributor license agreements.  See the NOTICE file distributed with
+     * this work for additional information regarding copyright ownership.
+     * The ASF licenses this file to You under the Apache License, Version 2.0
+     * (the "License"); you may not use this file except in compliance with
+     * the License.  You may obtain a copy of the License at
+     *
+     *     http://www.apache.org/licenses/LICENSE-2.0
+     *
+     * Unless required by applicable law or agreed to in writing, software
+     * distributed under the License is distributed on an "AS IS" BASIS,
+     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+     * See the License for the specific language governing permissions and
+     * limitations under the License.
+     */
+
+    /// <summary>
+    /// Extension methods for <see cref="Encoding"/>.
+    /// </summary>
+    internal static class EncodingExtensions
+    {
+        /// <summary>
+        /// Returns a new <see cref="Encoding"/> instance with the <see 
cref="DecoderFallback"/> set to throw
+        /// an exception when an invalid byte sequence is encountered.
+        /// <para />
+        /// This is equivalent to Java's <c>CodingErrorAction.REPORT</c> for 
both <c>onMalformedInput</c> and
+        /// <c>onUnmappableCharacter</c> and will throw a <see 
cref="DecoderFallbackException"/> when failing
+        /// to decode a string. This exception is equivalent to Java's 
<c>CharacterCodingException</c>, which is
+        /// a base exception type for both <c>MalformedInputException</c> and 
<c>UnmappableCharacterException</c>.
+        /// Thus, to translate Java code that catches any of those exceptions, 
you can catch
+        /// <see cref="DecoderFallbackException"/>.
+        /// </summary>
+        /// <param name="encoding">The encoding to clone and set the fallback 
on.</param>
+        /// <returns>A new <see cref="Encoding"/> instance with the fallback 
set to throw an exception.</returns>
+        /// <remarks>
+        /// Note that it is necessary to return a new, cloned <see 
cref="Encoding"/> instance because
+        /// the <see cref="Encoding.DecoderFallback"/> property is read-only 
without cloning.
+        /// </remarks>
+        public static Encoding WithDecoderExceptionFallback(this Encoding 
encoding)
+        {
+            Encoding newEncoding = (Encoding)encoding.Clone();

Review Comment:
   I reviewed the GetHashCode .NET source code, and it seems fine for this use 
to me. In the case of the most common encoding we use, UTF8, it adds the hash 
codes of the encoder and decoder fallbacks (which are either the hashcode of 
the replacement string or a hardcoded number in the case of exception 
fallback), the UTF8 codepage constant 65001, and then either a 1 or 0 if it 
should use the BOM. If by "optimized" you mean optimized for good hash 
distribution, that is possibly true: the comment in the code says "Not great 
distribution, but this is relatively unlikely to be used as the key in a 
hashtable." Hm... well we are about to do just that.
   
   But honestly, most apps are only going to have one record in this dictionary 
(UTF8) or at most a few, so hashtable distribution probably does not matter. I 
ran some benchmarks comparing no caching to concurrent dictionary (as well as 
just caching the common UTF8 case to a static field), in the case of just UTF8 
as well as a few common encodings, and concurrent dictionary did result in a 
speed-up in the just-UTF8 case and reduced allocations in both, even with 
perhaps suboptimal hash distribution. It was slightly slower when multiple 
encodings were stored, but the allocations were still reduced. The happy-path 
caching case of just caching UTF8 was the fastest, and had the same allocations 
as the ConcurrentDictionary case. If you're interested in that, we could take 
that approach. We'd have special cases for Encoding.UTF8 as well as the IOUtils 
version, but anything else would allocate a new encoding on Clone. I think most 
apps would only be using the UTF8 case.
   
   ```
   | Method                             | Mean     | Error    | StdDev   | Gen0 
  | Allocated |
   |----------------------------------- 
|---------:|---------:|---------:|-------:|----------:|
   | NoCaching_JustUTF8                 | 42.77 ns | 0.794 ns | 0.704 ns | 
0.0255 |     160 B |
   | NoCaching_AFewEncodings            | 77.53 ns | 0.379 ns | 0.355 ns | 
0.0331 |     208 B |
   | ConcurrentDictionary_JustUTF8      | 33.15 ns | 0.179 ns | 0.150 ns | 
0.0102 |      64 B |
   | ConcurrentDictionary_AFewEncodings | 94.89 ns | 0.495 ns | 0.413 ns | 
0.0235 |     148 B |
   | StaticUTF8_JustUTF8                | 23.92 ns | 0.353 ns | 0.313 ns | 
0.0102 |      64 B |
   | StaticUTF8_AFewEncodings           | 76.18 ns | 1.214 ns | 1.076 ns | 
0.0293 |     184 B |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 [lucenenet]

Reply via email to