[GitHub] [iceberg] dramaticlly commented on a diff in pull request #8560: Spark 3.4: Push down system functions by V2 filters for rewriting DataFiles and PositionDeleteFiles

via GitHub Thu, 14 Sep 2023 14:57:15 -0700


dramaticlly commented on code in PR #8560:
URL: https://github.com/apache/iceberg/pull/8560#discussion_r1326553809



##########
spark/v3.4/spark/src/main/scala/org/apache/spark/sql/execution/datasources/SparkExpressionConverter.scala:
##########
@@ -28,14 +28,15 @@ import org.apache.spark.sql.catalyst.expressions.Literal
 import org.apache.spark.sql.catalyst.plans.logical.Filter
 import org.apache.spark.sql.catalyst.plans.logical.LeafNode
 import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy
 
 object SparkExpressionConverter {
 
   def convertToIcebergExpression(sparkExpression: Expression): 
org.apache.iceberg.expressions.Expression = {
     // Currently, it is a double conversion as we are converting Spark 
expression to Spark filter
     // and then converting Spark filter to Iceberg expression.
     // But these two conversions already exist and well tested. So, we are 
going with this approach.
-    SparkFilters.convert(DataSourceStrategy.translateFilter(sparkExpression, 
supportNestedPredicatePushdown = true).get)
+    
SparkV2Filters.convert(DataSourceV2Strategy.translateFilterV2(sparkExpression).get)

Review Comment:
   thank you @ConeyLiu , I will rebase my change if 8394 gets merged first.
   Also added some comments where we are now convert Spark catalyst expression 
to Predicate instead of spark source filter. I ran all the unit tests to make 
sure old filter are working as expected. 



##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java:
##########
@@ -260,7 +285,7 @@ public void testBinPackAfterPartitionChange() {
   }
 
   @Test
-  public void testBinPackWithDeletes() throws Exception {

Review Comment:
   yes, IntelliJ suggested during code analysis and I think we dont really 
throw here.



##########
spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java:
##########
@@ -480,7 +511,10 @@ public void testRewriteDataFilesWithAllPossibleFilters() {
     sql(
         "CALL %s.system.rewrite_data_files(table => '%s'," + " where => 'c2 
like \"%s\"')",
         catalogName, tableIdent, "car%");
-
+    // StringStartsWith

Review Comment:
   yeah good catch. Initially I was about to add filter here for bucket 
transform (forgot to change) but I end up create a new method to test all 
V2Filters can be evaluated without exception.



##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java:
##########
@@ -223,6 +223,31 @@ public void testBinPackWithFilter() {
     assertEquals("Rows must match", expectedRecords, actualRecords);
   }
 
+  @Test
+  public void testBinPackWithFilterOnBucketExpression() {
+    PartitionSpec spec = PartitionSpec.builderFor(SCHEMA).bucket("c3", 
2).build();

Review Comment:
   Thank you Coney, your test utils class is super helpful. However I realized 
this SparkAction tests was assumed to use hadoop catalog so the table creation 
is a bit different as it's by table location 
https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java#L1563.
 But I opted to use your `SystemFunctionPushDownHelper` in 
TestRewriteDataFilesProcedure. 



##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java:
##########
@@ -1606,6 +1631,20 @@ private Table createTypeTestTable() {
     return table;
   }
 
+  private void insertData(int recordCount, int numDataFiles) {

Review Comment:
   dropped this function as we dont need this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] dramaticlly commented on a diff in pull request #8560: Spark 3.4: Push down system functions by V2 filters for rewriting DataFiles and PositionDeleteFiles

Reply via email to