amogh-jahagirdar commented on code in PR #7591:
URL: https://github.com/apache/iceberg/pull/7591#discussion_r1192679845
##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java:
##########
@@ -57,6 +58,10 @@ public class TestParquetVectorizedReads extends AvroDataTest
{
static final Function<GenericData.Record, GenericData.Record> IDENTITY =
record -> record;
+ static {
+ System.setProperty("arrow.enable_null_check_for_get", "true");
+ }
Review Comment:
TBH this probably is not the right way to handle the failing tests. So for
context what happens is that
[here](https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java#L99)
we always use the `NullCheckingForGet.NULL_CHECKING_ENABLED` value which is a
final static variable which is set once based on the value of the
arrow.enable_null_check_for_get property.
Right now we have some tests (most of which want to validate the validity
buffer) and a few which do not. It's not possible to dynamically set the
property for these different cases because it's static final, once it's set,
every read of the value will just yield the original value.
Before we had an API to explicitly passing in to the vectorized reader if we
should use the validity buffer, but now we want to deprecate that.
In practice users will set this once for their Spark job but for the
purpose of testing we want to validate both paths (my implementation here just
optimizes for the majority of the existing test cases, but misses out on
validating the behavior when this is set to false which is the default due to
better performance.
Long story short, I'm thinking we should still expose a method but it will
be package private, for setting the validity buffer. this package private
method would be used for the purpose of testing, and constructing a parquet
reader depending on what we want to test.
Thoughts @aokolnychyi @singhpk234 ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]