chenhao-db opened a new pull request, #45730:
URL: https://github.com/apache/spark/pull/45730

   ### What changes were proposed in this pull request?
   
   In the `Window` node, both `partitionSpec` and `orderSpec` must be 
orderable, but the current type check only verifies `orderSpec` is orderable. 
This can cause an error in later optimizing phases.
   
   Given a query:
   
   ```
   with t as (select id, map(id, id) as m from range(0, 10))
   select rank() over (partition by m order by id) from t
   ```
   
   Before the PR, it fails with an `INTERNAL_ERROR`:
   
   ```
   org.apache.spark.SparkException: [INTERNAL_ERROR] grouping/join/window 
partition keys cannot be map type. SQLSTATE: XX000
   at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
   at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
   at 
org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.needNormalize(NormalizeFloatingNumbers.scala:103)
   at 
org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.org$apache$spark$sql$catalyst$optimizer$NormalizeFloatingNumbers$$needNormalize(NormalizeFloatingNumbers.scala:94)
   ...
   ```
   
   After the PR, it fails with a `DATATYPE_MISMATCH.INVALID_ORDERING_TYPE`, 
which is expected:
   
   ```
     org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[DATATYPE_MISMATCH.INVALID_ORDERING_TYPE] Cannot resolve "m" due to data type 
mismatch: The `attributereference` does not support ordering on type 
"MAP<BIGINT, BIGINT>". SQLSTATE: 42K09; line 2 pos 53;
   Project [RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4]
   +- Project [id#1L, m#0, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS 
FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4, RANK() OVER 
(PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING 
AND CURRENT ROW)#4]
      +- Window [rank(id#1L) windowspecdefinition(m#0, id#1L ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS RANK() 
OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED 
PRECEDING AND CURRENT ROW)#4], [m#0], [id#1L ASC NULLS FIRST]
         +- Project [id#1L, m#0]
            +- SubqueryAlias t
               +- SubqueryAlias t
                  +- Project [id#1L, map(id#1L, id#1L) AS m#0]
                     +- Range (0, 10, step=1, splits=None)
     at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
   ...
   ```
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to