[GitHub] [hudi] vinothchandar commented on a change in pull request #4996: [HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow

GitBox Mon, 21 Mar 2022 09:01:31 -0700


vinothchandar commented on a change in pull request #4996:
URL: https://github.com/apache/hudi/pull/4996#discussion_r831247780




##########
File path: 
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystExpressionUtils.scala
##########
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql

Review comment:
       is this code adapted from somewhere? if so, can you please add source 
attribution

##########
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala
##########
@@ -59,147 +62,205 @@ object DataSkippingUtils extends Logging {
   }
 
   private def tryComposeIndexFilterExpr(sourceExpr: Expression, indexSchema: 
StructType): Option[Expression] = {
-    def minValue(colName: String) = col(getMinColumnNameFor(colName)).expr
-    def maxValue(colName: String) = col(getMaxColumnNameFor(colName)).expr
-    def numNulls(colName: String) = col(getNumNullsColumnNameFor(colName)).expr
-
-    def colContainsValuesEqualToLiteral(colName: String, value: Literal): 
Expression =
-    // Only case when column C contains value V is when min(C) <= V <= max(c)
-      And(LessThanOrEqual(minValue(colName), value), 
GreaterThanOrEqual(maxValue(colName), value))
-
-    def colContainsOnlyValuesEqualToLiteral(colName: String, value: Literal) =
-    // Only case when column C contains _only_ value V is when min(C) = V AND 
max(c) = V
-      And(EqualTo(minValue(colName), value), EqualTo(maxValue(colName), value))
-
+    //
+    // For translation of the Filter Expression for the Data Table into Filter 
Expression for Column Stats Index, we're
+    // assuming that
+    //    - The column A is queried in the Data Table (hereafter referred to 
as "colA")
+    //    - Filter Expression is a relational expression (ie "=", "<", "<=", 
...) of the following form
+    //
+    //      ```transform_expr(colA) = value_expr```
+    //
+    //      Where
+    //        - "transform_expr" is an expression of the _transformation_ 
which preserve ordering of the "colA"
+    //        - "value_expr" is an "value"-expression (ie one NOT referring to 
other attributes/columns or containing sub-queries)
+    //
+    // We translate original Filter Expr into the one querying Column Stats 
Index like following: let's consider
+    // equality Filter Expr referred to above:
+    //
+    //   ```transform_expr(colA) = value_expr```
+    //
+    // This expression will be translated into following Filter Expression for 
the Column Stats Index:
+    //
+    //   ```(transform_expr(colA_minValue) <= value_expr) AND (value_expr <= 
transform_expr(colA_maxValue))```

Review comment:
       nts: Let's take an example that parses a timestamp `ts` column into date 
using something like date_format. 
   
   ` date_format(ts, ...) = '2022-03-01'` 
   
   
   We will simply look for files that have overlap with that date. sgtm

##########
File path: 
hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/HoodieSpark2CatalystExpressionUtils.scala
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.HoodieSparkTypeUtils.isCastPreservingOrdering
+import org.apache.spark.sql.catalyst.expressions.{Add, AttributeReference, 
BitwiseOr, Cast, DateAdd, DateDiff, DateFormatClass, DateSub, Divide, Exp, 
Expm1, Expression, FromUTCTimestamp, FromUnixTime, Log, Log10, Log1p, Log2, 
Lower, Multiply, ParseToDate, ParseToTimestamp, ShiftLeft, ShiftRight, 
ToUTCTimestamp, ToUnixTimestamp, Upper}
+
+object HoodieSpark2CatalystExpressionUtils extends 
HoodieCatalystExpressionUtils {
+
+  override def tryMatchAttributeOrderingPreservingTransformation(expr: 
Expression): Option[AttributeReference] = {
+    expr match {
+      case OrderPreservingTransformation(attrRef) => Some(attrRef)
+      case _ => None
+    }
+  }
+
+  private object OrderPreservingTransformation {
+    def unapply(expr: Expression): Option[AttributeReference] = {
+      expr match {

Review comment:
       I guess these are are the transformations that we whitelist




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on a change in pull request #4996: [HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow

Reply via email to