NightOwl888 commented on issue #1151:
URL: https://github.com/apache/lucenenet/issues/1151#issuecomment-2751008411

   @paulirwin 
   
   Nice work tracking this down. I know how hard this can be!
   
   First of all, the below is all up for debate and is just my opinion.
   
   ## To ByteBuffer or Not To ByteBuffer?
   
   I should preface this with my desire to eliminate `ByteBuffer` from the 
codebase (especially in ICU4N). Ideally, we would move to modern .NET APIs 
instead, but in Lucene.NET, it probably means drifting pretty far from the 
original design because of how they have extended and/or wrapped `ByteBuffer` 
in several components and how unlike anything in .NET it is. Fortunately, the 
usages are localized to the `Lucene.Net.Store` namespace and some subclasses.
   
   `ByteBuffer` is almost the same thing as `BinaryReader` and `BinaryWriter` 
around a `MemoryStream`, except that it also natively supports big endian 
reads, which is missing from that set of tools. It also allows for quick access 
to other wrappers, such as `CharBuffer`, which wraps the same underlying memory 
to read arrays of `char` in addition to bytes. However, in practice, many of 
the buffers in ICU4N could be replaced with `Span<T>` or `Memory<T>` or the 
read-only equivalents depending on the operation.  These components cut much 
closer to the bone and are easier to optimize. `CharBuffer` in particular 
implements `ICharSequence`, which we are working to essentially replace with 
`ReadOnlySpan<char>` or `ReadOnlyMemory<char>` across the API, although the 
only place we used `CharBuffer` is in tests.
   
   The `ByteBuffer` API has some confusing members, such as `Flip()` (which 
doesn't flip any array members) and `Clear()` (which doesn't clear the buffer). 
These are managing both the position state and a "mark", which is presumably 
how it determines where to slice from. When porting, trying to grasp what is 
going on at a high level in addition to the low level in order to port it is a 
bit confusing. As such, it is hard to envision how to support moving away from 
this API when upgrading Lucene. It could be done, but we would need a roadmap 
to follow. Of course, using LLMs to get answers on this is a lot easier than 
analyzing source and reviewing docs to try to understand how best to port it.
   
   In addition, J2N is a general purpose library. ICU4N currently depends on 
`MemoryMappedViewByteBuffer` and it is potentially broken for that use case, as 
well. Whether we use it in the future or not, this is an existing J2N public 
API that is currently not performing well for all consumers, and will need to 
be addressed.
   
   Combined with the fact that we would all like a production release of 
Lucene.NET as soon as possible, fixing the `MemoryMappedViewByteBuffer` seems 
to be the most expedient option rather than fixing it in 2 or 3 different 
places and perhaps in 2 or 3 different ways.
   
   ## One Byte at a Time
   
   The purpose of reading one byte at a time in Java appears to be to eliminate 
allocating temporary buffers on the heap to convert primitive types such as 
`int`, `long`, and `char` to and from bytes. However, in .NET we have the 
option of putting these tiny buffers on the stack so we don't need to allocate 
any heap to do the conversions.
   
   ```c#
   Span<byte> buffer = stackalloc byte[4];
   ```
   
   But it does require there to be APIs to pass that span to without creating 
another allocation on the heap in order for it to be sensible, and `ByteBuffer` 
is currently missing them.
   
   I haven't run any benchmarks, but in `System.Buffers`, there is a 
`BinaryPrimitives` class that can be used to convert `Span<byte>` or 
`ReadOnlySpan<byte>` to and from primitive types. Reviewing the code, it is 
using `MemoryMarshal` along with some bit twiddling. Since there are no loops, 
I suspect this will generally be faster than how it was done in Java. Given 
that members that are reading or writing bytes are very unlikely to change in 
Lucene (other than the vint and vlong types), we can probably utilize these 
rather than the direct port from Java. Maybe there is a way to borrow some of 
the techniques from `BinaryPrimitives` to optimize vint, vlong, and LZ4, as 
well.
   
   
https://github.com/dotnet/runtime/tree/v9.0.3/src/libraries/System.Private.CoreLib/src/System/Buffers/Binary
   
   Note however, that only the integral number conversions are available prior 
to .NET Core in the BCL. So, we will need to patch the others. If we follow the 
convention used for `System.Numeric.BitOperations`, we should create a public 
static class named `J2N.Buffers.Binary.BinaryPrimitive` (singular, not plural) 
and put in all of the support there. We only need the primitive types that 
J2N/Java supports (no unsigned types), but it wouldn't hurt to include all of 
them and is probably easier to do because we can just copy, paste, and rename 
from .NET 9 for the most part. I am not sure whether it makes any sense to 
cascade the calls to the BCL when they are supported, since there doesn't seem 
to be much of a point to optimize this code further. It is unlikely to ever 
change.
   
   > We will need to reuse these in Lucene.NET `DataInput` and `DataOutput`, so 
it makes sense to make them shared and not sensitive to which target framework 
is being used.
   
   ## System.Memory Support
   
   `J2N.IO.Buffer` and `J2N.IO.ByteBuffer` would need to support 
`Get(Span<byte>)` and `Put(ReadOnlySpan<byte>)`, which can be used to load the 
buffer on the stack for the primitive type conversions. Much like with 
[`FileStream.Read(Span<byte> 
destination)`](https://github.com/dotnet/runtime/blob/v9.0.3/src/libraries/System.Private.CoreLib/src/System/IO/Strategies/BufferedFileStreamStrategy.cs#L153-L248),
 these should not support any `offset` or `length` overloads because spans 
allow you to slice the memory before passing them into the methods.
   
   Buffering the `MemoryMappedViewAccessor` means we can support these methods 
even though `MemoryMappedViewAccessor` has no span support.
   
   For the `ByteBuffer` class, we can create default implementations of `Get()` 
and `Put()` that check the `HasArray` property. 
   
   1. If it supports an array, we can use the `ProtectedArray` proprty to get a 
span of the appropriate length.
   2. If it doesn't support an array, it can throw `NotSupportedException`.
   3. Subclasses can override these virtual members to provide more 
optimizations and features. In this case, we will need to track whether 
reading/writing goes off the end of the buffer and load more data to fill in or 
read the remainder of the input.
   
   ## MemoryMappedViewByteBuffer Buffering Design
   
   1. We should not use the `ProtectedArray` property to expose the buffer 
externally because it assumes the entire buffer is available in the 
`Array`/`ProtectedArray` properties.
   2. `ProtectedHasBuffer` should return `false`.
   3. We should review and override members that can be optimized or should be 
done to account for buffering. Some missing overloads that could be optimized 
better include:
     a. `Equals(ByteBuffer)`
     b. `CompareTo(ByteBuffer)`
   4. The buffering can be modeled after `FileStream`, since it also supports 
both reading and writing to an underlying file while supporting a buffer.
   5. Much like `FileStream`, we can overload `Flush()` with a boolean 
parameter for `flushToDisk`, the default being `false`. Disposing the 
`MemoryMappedViewByteBuffer` should call `Flush(true)`.
   6. Ideally, the size of the buffer should be configurable in the constructor 
overloads, much like `FileStream`.
   7. There weren't a lot of tests for these classes in Apache Harmony or the 
JDK. I ported what I could, but there isn't a lot of coverage. If we can add 
more, that would be great.
   
   There are probably more things to consider and this plan is admittedly a bit 
half-baked, but this is a good start.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to