haoyangeng-db opened a new pull request, #51190:
URL: https://github.com/apache/spark/pull/51190

   ### What changes were proposed in this pull request?
   
   Adds support for accessing fields inside a Variant data type through the 
colon-sign operator.  The syntax is documented here: 
https://docs.databricks.com/aws/en/sql/language-manual/functions/colonsign
   
   ### Why are the changes needed?
   
   Provides a convenient way to access fields inside a Variant via SQL.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes -- The previously invalid (would throw ParseException) syntax is now 
supported.
   
   === In Scala Spark shell:
   
   Before:
   ```
   scala> spark.sql("SELECT PARSE_JSON('{ \"price\": 5 }'):price").collect
   warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
   org.apache.spark.sql.catalyst.parser.ParseException:
   [PARSE_SYNTAX_ERROR] Syntax error at or near ':'. SQLSTATE: 42601 (line 1, 
pos 35)
   
   == SQL ==
   SELECT PARSE_JSON('{ "price": 5 }'):price
   -----------------------------------^^^
   
     at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:274)
     at 
org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:97)
     at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:54)
     at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(AbstractSqlParser.scala:93)
     at 
org.apache.spark.sql.classic.SparkSession.$anonfun$sql$5(SparkSession.scala:492)
     at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
     at 
org.apache.spark.sql.classic.SparkSession.$anonfun$sql$4(SparkSession.scala:491)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
     at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:490)
     at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:504)
     at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:513)
     at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:91)
     ... 42 elided
   ```
   
   After:
   ```
   scala> spark.sql("SELECT PARSE_JSON('{ \"price\": 5 }'):price").collect
   val res0: Array[org.apache.spark.sql.Row] = Array([5])
   ```
   
   === In PySpark REPL:
   
   Before:
   >>> spark.sql("select parse_json('{ \"price\": 5 }'):price::int").collect()
   Traceback (most recent call last):
     File "<python-input-0>", line 1, in <module>
       spark.sql("select parse_json('{ \"price\": 5 }'):price::int").collect()
       ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/Users/haoyan.geng/oss-scala/python/pyspark/sql/session.py", line 
1810, in sql
       return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
                        ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
     File 
"/Users/haoyan.geng/oss-scala/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py",
 line 1362, in __call__
       return_value = get_return_value(
           answer, self.gateway_client, self.target_id, self.name)
     File 
"/Users/haoyan.geng/oss-scala/python/pyspark/errors/exceptions/captured.py", 
line 294, in deco
       raise converted from None
   pyspark.errors.exceptions.captured.ParseException: 
   [PARSE_SYNTAX_ERROR] Syntax error at or near ':'. SQLSTATE: 42601 (line 1, 
pos 35)
   
   == SQL ==
   select parse_json('{ "price": 5 }'):price::int
   -----------------------------------^^^
   
   After: 
   >>> spark.sql("select parse_json('{ \"price\": 5 }'):price::int").collect()
   [Row(price=5)]
   
   ### How was this patch tested?
   - Added new test cases in SQLQueryTestSuite 
(sql/core/src/test/resources/sql-tests/inputs/variant-field-extractions.sql).
   - Manually tested the new behavior in Spark Shell (Scala) and PySpark REPL.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to