kbendick commented on a change in pull request #1334:
URL: https://github.com/apache/iceberg/pull/1334#discussion_r470922881
##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -231,7 +231,24 @@ public void nonNullWrite(int rowId, String data,
ColumnVector output) {
@Override
public void nonNullWrite(int rowId, ByteBuffer data, ColumnVector output) {
- ((BytesColumnVector) output).setRef(rowId, data.array(), 0,
data.array().length);
+ // We technically can't be sure if the ByteBuffer coming in is on or off
+ // heap so we cannot safely call `.array()` on it without first checking
+ // via the method ByteBuffer.hasArray().
+ // See: https://errorprone.info/bugpattern/ByteBufferBackingArray
+ //
+ // When there is a backing heap based byte array, we avoided the
overhead of
+ // copying, which is especially important for small byte buffers.
+ //
+ // TODO - This copy slows it down, perhap unnecessarily. Is there any
other way to tell, or no?
Review comment:
I went looking into spark's implementation of off-heap byte arrays (for
the spark-native version) and it somewhat lead me to believe that it could be
reading in values as off-heap byte buffers when using
`spark.memory.offHeap.enabled=true` and some value for --conf
`spark.memory.offHeap.size`.
I'm working on a demo to see if that's true, but of note when I started up a
`spark-shell` with those parameters specified (`4G` for the off heap size), I
got the following error which leads me to believe that ByteBuffers were being
created in an off-heap way.
```
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
(file:/usr/local/Cellar/apache-spark/3.0.0/libexec/jars/spark-unsafe_2.12-3.0.0.jar)
to constructor java.nio.DirectByteBuffer(long,int)
```
I am trying to work on a demo to test writing specifically off heap byte
arrays to `format("orc")` and then try reading it in and see what happens, but
I'm not 100% sure when I'll have the time.
But if the ByteBuffers are off heap and we pass them to this function
without "sanitizing" them, we need the check. The check is pretty inexpensive
as well fwiw, it just checks if a primitive array element if equal to null or
not.
##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -231,7 +231,24 @@ public void nonNullWrite(int rowId, String data,
ColumnVector output) {
@Override
public void nonNullWrite(int rowId, ByteBuffer data, ColumnVector output) {
- ((BytesColumnVector) output).setRef(rowId, data.array(), 0,
data.array().length);
+ // We technically can't be sure if the ByteBuffer coming in is on or off
+ // heap so we cannot safely call `.array()` on it without first checking
+ // via the method ByteBuffer.hasArray().
+ // See: https://errorprone.info/bugpattern/ByteBufferBackingArray
+ //
+ // When there is a backing heap based byte array, we avoided the
overhead of
+ // copying, which is especially important for small byte buffers.
+ //
+ // TODO - This copy slows it down, perhap unnecessarily. Is there any
other way to tell, or no?
+ // My guess is no, if I consider things like VectorizedOrcReaders
on Spark.
+ if (data.hasArray()) {
+ ((BytesColumnVector) output).setRef(rowId, data.array(), 0,
data.array().length);
Review comment:
I'd like to point out that this is the original version of
`nonNullWrite`, unless you're referring to the `else` block lines 246-251 just
below your comment, in which case that's my addition.
https://github.com/apache/iceberg/pull/1334/files#diff-b1b07b15f036000a3f2bed76fdd9f961R246-R251
I believe my addition (in the else block) is correct as it does call
`remaining`. If we use `hasArray`, I just left it as the original function. I'd
be happy to update anything that needs updating though.
##########
File path: pig/src/main/java/org/apache/iceberg/pig/IcebergPigInputFormat.java
##########
@@ -243,6 +243,7 @@ private boolean advance() throws IOException {
return true;
}
+ @SuppressWarnings("ByteBufferBackingArray")
Review comment:
Ok. Thanks for the insight. This was one admittedly that I suppressed
but wasn't sure about. I'm in agreement though, due to the same arguments I've
given elsewhere. I know much less about Pig, but I think it's better to be safe
than sorry here if we don't know about the exact behavior of the input source,
not or in the future.
Should I update it in this PR or another PR?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]