Re: [PR] [SPARK-53293][SQL] Modify exprIdToOrdinal implementation for speedup on queries with wide tables [spark]

via GitHub Tue, 26 Aug 2025 23:45:09 -0700


cloud-fan commented on code in PR #52046:
URL: https://github.com/apache/spark/pull/52046#discussion_r2303024565



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala:
##########
@@ -95,24 +95,82 @@ package object expressions  {
       StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable, 
a.metadata)))
     }
 
-    // It's possible that `attrs` is a linked list, which can lead to bad O(n) 
loops when
-    // accessing attributes by their ordinals. To avoid this performance 
penalty, convert the input
-    // to an array.
-    @transient private lazy val attrsArray = attrs.toArray
+    // Compute min and max expression IDs in a single pass
+    @transient private lazy val minMaxExprId: (Long, Long) = {
+      if (attrs.isEmpty) {
+        (0L, -1L)
+      } else {
+        var min = Long.MaxValue
+        var max = Long.MinValue
+        attrs.foreach { attr =>
+          val id = attr.exprId.id

Review Comment:
   One possible reason that it breaks shuffle reuse: Previous we use `ExprId` 
as the hash map key, but now we completely ignore `ExprId#jvmId`. We should 
record the jvmId and return -1 directly in `def indexOf` if the jvmId is 
different.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-53293][SQL] Modify exprIdToOrdinal implementation for speedup on queries with wide tables [spark]

Reply via email to