haoyangeng-db opened a new pull request, #51190: URL: https://github.com/apache/spark/pull/51190
### What changes were proposed in this pull request? Adds support for accessing fields inside a Variant data type through the colon-sign operator. The syntax is documented here: https://docs.databricks.com/aws/en/sql/language-manual/functions/colonsign ### Why are the changes needed? Provides a convenient way to access fields inside a Variant via SQL. ### Does this PR introduce _any_ user-facing change? Yes -- The previously invalid (would throw ParseException) syntax is now supported. === In Scala Spark shell: Before: ``` scala> spark.sql("SELECT PARSE_JSON('{ \"price\": 5 }'):price").collect warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` org.apache.spark.sql.catalyst.parser.ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near ':'. SQLSTATE: 42601 (line 1, pos 35) == SQL == SELECT PARSE_JSON('{ "price": 5 }'):price -----------------------------------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:274) at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:97) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:54) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(AbstractSqlParser.scala:93) at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$5(SparkSession.scala:492) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148) at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$4(SparkSession.scala:491) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804) at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:490) at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:504) at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:513) at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:91) ... 42 elided ``` After: ``` scala> spark.sql("SELECT PARSE_JSON('{ \"price\": 5 }'):price").collect val res0: Array[org.apache.spark.sql.Row] = Array([5]) ``` === In PySpark REPL: Before: >>> spark.sql("select parse_json('{ \"price\": 5 }'):price::int").collect() Traceback (most recent call last): File "<python-input-0>", line 1, in <module> spark.sql("select parse_json('{ \"price\": 5 }'):price::int").collect() ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/haoyan.geng/oss-scala/python/pyspark/sql/session.py", line 1810, in sql return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self) ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^ File "/Users/haoyan.geng/oss-scala/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__ return_value = get_return_value( answer, self.gateway_client, self.target_id, self.name) File "/Users/haoyan.geng/oss-scala/python/pyspark/errors/exceptions/captured.py", line 294, in deco raise converted from None pyspark.errors.exceptions.captured.ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near ':'. SQLSTATE: 42601 (line 1, pos 35) == SQL == select parse_json('{ "price": 5 }'):price::int -----------------------------------^^^ After: >>> spark.sql("select parse_json('{ \"price\": 5 }'):price::int").collect() [Row(price=5)] ### How was this patch tested? - Added new test cases in SQLQueryTestSuite (sql/core/src/test/resources/sql-tests/inputs/variant-field-extractions.sql). - Manually tested the new behavior in Spark Shell (Scala) and PySpark REPL. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
