Re: [PR] BREAKING: Implement ISpanAppendable in CharBlockArray and ICharTermAttribute/CharTermAttributeImpl [lucenenet]

via GitHub Thu, 21 Nov 2024 09:40:08 -0800


NightOwl888 commented on code in PR #1028:
URL: https://github.com/apache/lucenenet/pull/1028#discussion_r1848113664



##########
src/Lucene.Net.Facet/Taxonomy/WriterCache/CharBlockArray.cs:
##########
@@ -310,29 +248,8 @@ public virtual CharBlockArray Append(string? value, int 
startIndex, int length)
             if (startIndex > value.Length - length)
                 throw new ArgumentOutOfRangeException(nameof(startIndex), 
$"Index and length must refer to a location within the string. For example 
{nameof(startIndex)} + {nameof(length)} <= {nameof(Length)}.");
 
-
-            int offset = startIndex;
-            int remain = length;
-            while (remain > 0)
-            {
-                if (this.current.length == this.blockSize)
-                {
-                    AddBlock();
-                }
-                int toCopy = remain;
-                int remainingInBlock = this.blockSize - this.current.length;
-                if (remainingInBlock < toCopy)
-                {
-                    toCopy = remainingInBlock;
-                }
-                value.CopyTo(offset, this.current.chars, this.current.length, 
toCopy);

Review Comment:
   See my note above about copying performance.



##########
src/Lucene.Net/Analysis/TokenAttributes/CharTermAttributeImpl.cs:
##########
@@ -216,10 +216,8 @@ public CharTermAttribute Append(string value, int 
startIndex, int charCount)
             if (startIndex > value.Length - charCount)
                 throw new ArgumentOutOfRangeException(nameof(startIndex), 
$"Index and length must refer to a location within the string. For example 
{nameof(startIndex)} + {nameof(charCount)} <= {nameof(Length)}.");
 
-            value.CopyTo(startIndex, InternalResizeBuffer(termLength + 
charCount), termLength, charCount);

Review Comment:
   See my note above about copying performance.



##########
src/Lucene.Net.Facet/Taxonomy/WriterCache/CharBlockArray.cs:
##########
@@ -235,29 +214,8 @@ public virtual CharBlockArray Append(char[]? value, int 
startIndex, int length)
             if (startIndex > value.Length - length)
                 throw new ArgumentOutOfRangeException(nameof(startIndex), 
$"Index and length must refer to a location within the string. For example 
{nameof(startIndex)} + {nameof(length)} <= {nameof(Length)}.");
 
-
-            int offset = startIndex;
-            int remain = length;
-            while (remain > 0)
-            {
-                if (this.current.length == this.blockSize)
-                {
-                    AddBlock();
-                }
-                int toCopy = remain;
-                int remainingInBlock = this.blockSize - this.current.length;
-                if (remainingInBlock < toCopy)
-                {
-                    toCopy = remainingInBlock;
-                }
-                Arrays.Copy(value, offset, this.current.chars, 
this.current.length, toCopy);

Review Comment:
   See my note above about copying performance.



##########
src/Lucene.Net/Analysis/TokenAttributes/CharTermAttributeImpl.cs:
##########
@@ -234,11 +232,8 @@ public CharTermAttribute Append(char[] value)
                 //return AppendNull();
                 return this; // No-op
 
-            int len = value.Length;
-            value.CopyTo(InternalResizeBuffer(termLength + len), termLength);

Review Comment:
   See my note above about copying performance.



##########
src/Lucene.Net.Facet/Taxonomy/WriterCache/CharBlockArray.cs:
##########
@@ -192,28 +191,8 @@ public virtual CharBlockArray Append(char[]? value)
                 return this; // No-op
             }
 
-            int remain = value.Length;
-            int offset = 0;
-            while (remain > 0)
-            {
-                if (this.current.length == this.blockSize)
-                {
-                    AddBlock();
-                }
-                int toCopy = remain;
-                int remainingInBlock = this.blockSize - this.current.length;
-                if (remainingInBlock < toCopy)
-                {
-                    toCopy = remainingInBlock;
-                }
-                Arrays.Copy(value, offset, this.current.chars, 
this.current.length, toCopy);

Review Comment:
   In #766, it was discovered that on .NET Framework using 
`ReadOnlySpan<T>.Copy()` is significantly slower than other copy methods in 
most cases. There was an `IArrayCopier<T>` added to select the best copy method 
depending on data type and other factors: 
https://github.com/apache/lucenenet/pull/766/files#diff-c0b8165f7f169cb39fade950ceb5b519f42e4869be060bb3114716e453bf8d37.
 This is exposed as `Arrays.Copy<T>()`. It would be best to add the 
`Append(ReadOnlySpan<char>)` overload without modifying these other overloads, 
unless you want to go through benchmarking the char data type all over again or 
we are switching from a less performant copy to a more performant copy, such as 
`Arrays.Copy<T>()`.



##########
src/Lucene.Net/Analysis/TokenAttributes/CharTermAttributeImpl.cs:
##########
@@ -260,10 +255,8 @@ public CharTermAttribute Append(char[] value, int 
startIndex, int charCount)
             if (startIndex > value.Length - charCount)
                 throw new ArgumentOutOfRangeException(nameof(startIndex), 
$"Index and length must refer to a location within the string. For example 
{nameof(startIndex)} + {nameof(charCount)} <= {nameof(Length)}.");
 
-            Arrays.Copy(value, startIndex, InternalResizeBuffer(termLength + 
charCount), termLength, charCount);

Review Comment:
   See my note above about copying performance.



##########
src/Lucene.Net.Tests/Analysis/TokenAttributes/TestCharTermAttributeImpl.cs:
##########
@@ -326,6 +330,10 @@ public virtual void 
TestAppendableInterfaceWithLongSequences()
             const string longTestString = "012345678901234567890123456789";
             t.Append(new CharSequenceAnonymousClass(longTestString));
             Assert.AreEqual("4567890123456" + longTestString, t.ToString());
+
+            // LUCENENET specific - test ReadOnlySpan<char> overload
+            
t.SetEmpty().Append("01234567890123456789012345678901234567890123456789".AsSpan());
+            
Assert.AreEqual("01234567890123456789012345678901234567890123456789", 
t.ToString());

Review Comment:
   Since `ReadOnlySpan<char>` is often a direct replacement for `CharBuffer` we 
should demonstrate how to slice `ReadOnlySpan<char>` the same way `CharBuffer` 
is sliced.
   
   ```c#
   t.Append(/*(ICharSequence)*/ 
CharBuffer.Wrap("01234567890123456789012345678901234567890123456789".ToCharArray()),
 3, 50 - 3); // LUCENENET: Corrected 3rd parameter
   
Assert.AreEqual("0123456789012345678901234567890123456789012345678934567890123456789012345678901234567890123456789",
 t.ToString());
   ```
   
   Should be translated as:
   
   ```c#
   t.Append("01234567890123456789012345678901234567890123456789".AsSpan(3, 50 - 
3)); // LUCENENET: Corrected 2nd parameter
   
Assert.AreEqual("0123456789012345678901234567890123456789012345678934567890123456789012345678901234567890123456789",
 t.ToString());
   ```
   
   It looks like we may need to adjust the expected value depending on where 
this is called, though. Maybe it would be best to break it out as a 
`TestSpanAppendableInterface()` test. Note that this test is more of a demo for 
users to show how to translate the Java code to .NET than a real test.



##########
src/Lucene.Net/Analysis/TokenAttributes/CharTermAttribute.cs:
##########
@@ -76,15 +77,15 @@ public interface ICharTermAttribute : IAttribute, 
ICharSequence, IAppendable
         /// the termBuffer array. Use this to truncate the termBuffer
         /// or to synchronize with external manipulation of the termBuffer.
         /// Note: to grow the size of the array,
-        /// use <see cref="ResizeBuffer(int)"/> first. 
-        /// NOTE: This is exactly the same operation as calling the <see 
cref="Length"/> setter, the primary 
+        /// use <see cref="ResizeBuffer(int)"/> first.
+        /// NOTE: This is exactly the same operation as calling the <see 
cref="Length"/> setter, the primary
         /// difference is that this method returns a reference to the current 
object so it can be chained.
         /// <code>
         /// obj.SetLength(30).Append("hey you");
         /// </code>
         /// </summary>
         /// <param name="length"> the truncated length </param>
-        ICharTermAttribute SetLength(int length); 
+        ICharTermAttribute SetLength(int length);

Review Comment:
   There doesn't seem to be a good reason to force implementations to define 
`SetLength()`, since it should just cascade the call to the setter of `Length`. 
We should remove this from the interface and make this method into an extension 
method on `ICharTermAttribute` (in a 
`Lucene.Net.Analysis.TokenAttributes.Extensions` namespace). It should also be 
made to return the type of `ICharTermAttribute` it is called upon rather than 
the interface type.
   
   This can be made into a separate issue.



##########
src/Lucene.Net/Analysis/TokenAttributes/CharTermAttribute.cs:
##########
@@ -26,7 +26,8 @@ namespace Lucene.Net.Analysis.TokenAttributes
     /// <summary>
     /// The term text of a <see cref="Token"/>.
     /// </summary>
-    public interface ICharTermAttribute : IAttribute, ICharSequence, 
IAppendable
+    public interface ICharTermAttribute : IAttribute, ICharSequence, 
IAppendable,
+        ISpanAppendable /* LUCENENET specific */

Review Comment:
   This appears to be the breaking change you were referring to. I am debating 
whether it makes sense to add this interface here, since it was specifically 
separated out so that it wouldn't need to be a breaking change. But if we are 
going to break this, now is the time.
   
   Keep in mind, there is a degraded `Append(ReadOnlySpan<char>)` extension 
method that is functional if this isn't implemented. The implementation is just 
an optimization. So, while there is a clear performance benefit to forcing the 
user to implement this, it isn't strictly necessary. If the user does implement 
`ISpanAppendable` on the concrete type, it will use that implementation instead 
of the degraded one.
   
   Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] BREAKING: Implement ISpanAppendable in CharBlockArray and ICharTermAttribute/CharTermAttributeImpl [lucenenet]

Reply via email to