kbendick commented on a change in pull request #1334:
URL: https://github.com/apache/iceberg/pull/1334#discussion_r471895373



##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -231,7 +231,33 @@ public void nonNullWrite(int rowId, String data, 
ColumnVector output) {
 
     @Override
     public void nonNullWrite(int rowId, ByteBuffer data, ColumnVector output) {
-      ((BytesColumnVector) output).setRef(rowId, data.array(), 0, 
data.array().length);
+      // Don't assume the we're reading in the entire backing array.
+      //
+      // We use the same logic for on heap vs off heap buffers as we don't 
assume
+      // that the position of the byte buffer received for the write is at the 
beginning
+      // position, such an assumption is needed to make use of methods like 
`.slice()`
+      // or any methods that would be useful here if we checked if there is an 
on-heap
+      // backing array and that the buffer is at the beginning position. (aka 
we don't check
+      // that one can use if `data.hasArray()` is true as we couldn't make any 
optimizations
+      // in that case).
+      int position = data.position();
+      int limit = data.limit();
+      int curIndex = data.arrayOffset() + data.position();
+      int endIndex = curIndex + data.remaining();
+
+      // Prep for copy into bytes
+      byte[] bytes = new byte[data.remaining()];
+      data.limit(curIndex + limit);
+      data.position(curIndex);
+
+      // Perform copy into bytes of remainder of byte buffer.
+      data.get(bytes, curIndex, endIndex - curIndex);
+
+      // Reset the byte buffer.
+      data.limit(limit);
+      data.position(position);

Review comment:
       I took the logic for copying from following 
`org.apache.iceberg.io.ParquetValueWriters` into 
`org.apache.parquet.io.api.Binary`, specifically the need to update both the 
limit and the position in order to return the original ByteBuffer, `data`, back 
to its original position as `ByteBuffer` offers no API to to that.
   
   Since I was no longer going to use any optimizations if the buffer is on 
heap or not (such as the usage of `.slice()` or `.duplicate()` or `.array()` I 
chose to skip entirely the check for `data.hasArray()`.
   
   I somewhat copied emulated the logic to copy and reset the original byte 
buffer `data`'s position (which could be invalidated by moving the position 
passed the limit), as demonstrated in `org.apache.parquet.io.api.Binary`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to