[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7591: Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release

via GitHub Fri, 12 May 2023 11:29:12 -0700


amogh-jahagirdar commented on code in PR #7591:
URL: https://github.com/apache/iceberg/pull/7591#discussion_r1192679845



##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java:
##########
@@ -57,6 +58,10 @@ public class TestParquetVectorizedReads extends AvroDataTest 
{
 
   static final Function<GenericData.Record, GenericData.Record> IDENTITY = 
record -> record;
 
+  static {
+    System.setProperty("arrow.enable_null_check_for_get", "true");
+  }

Review Comment:
   This probably  is not the right way to handle the failing tests. So for 
context what happens is that 
[here](https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java#L99)
 we always use the `NullCheckingForGet.NULL_CHECKING_ENABLED` value which is a 
final static variable which is set once based on the value of the 
arrow.enable_null_check_for_get property.
   
   Right now we have some tests (most of which want to validate the validity 
buffer) and a few which do not. It's not possible to dynamically set the 
property for these different cases because it's static final, once it's set, 
every read of the value will just yield the original value. 
   
   Before we had an API to explicitly passing in to the vectorized reader if we 
should use the validity buffer, but now we want to deprecate that.
   
   In practice users will set this once for their Spark job but  for the 
purpose of testing we want to validate both paths (my implementation here just 
optimizes for the majority of the existing test cases, but misses out on 
validating the behavior when this is set to false which is the default due to 
better performance.
   
   Long story short, I'm thinking we should still expose a method but it will 
be package private, for setting the validity buffer. this package private 
method would be used for the purpose of testing, and constructing a parquet 
reader depending on what we want to test.
   
   Thoughts @aokolnychyi @singhpk234 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7591: Spark: Remove deprecated VectorizedSparkParquetReaders#buildReader API for 1.3.0 release

Reply via email to