[jira] [Updated] (SPARK-50656) JDBC Reader Fails to Handle Complex Types (Array, Map) from Trino

Narayan Bhawar (Jira) Tue, 24 Dec 2024 00:00:39 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-50656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Narayan Bhawar updated SPARK-50656:
-----------------------------------
    Description: 
{*}Description{*}:

I am encountering an issue when using Spark to read data from a Trino instance 
via JDBC. Specifically, when querying complex types such as {{ARRAY}} or 
{{MAP}} from Trino, Spark throws an error indicating that it cannot recognize 
these SQL types. Below is the context:

{*}Code Example{*}:

 
{code:java}
val sourceDF = spark.read
  .format("jdbc")
  .option("driver", "io.trino.jdbc.TrinoDriver")
  .option("url", "jdbc:trino://localhost:8181")
  .option("query", "select address from minio.qa.nbcheck1")
  .load(){code}
 

 

*Error Message:*

 
{code:java}
2/04 03:49:59 INFO SparkContext: SparkContext already stopped.
Exception in thread "main" org.apache.spark.SparkSQLException: 
[UNRECOGNIZED_SQL_TYPE] Unrecognized SQL type - name: array (row(city varchar, 
state varchar)), id: ARRAY.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.unrecognizedSqlTypeError(QueryExecutionErrors.scala:992){code}
 

 

{*}Root Cause{*}:

The error seems to be occurring because Spark's JDBC data source does not 
recognize complex SQL types like {{ARRAY}} or {{MAP}} from Trino by default. 
This is confirmed by the following relevant section of Spark's code:

[https://github.com/apache/spark/blob/v3.5.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala]
{code:java}
private def getCatalystType(
  sqlType: Int,
  typeName: String,
  precision: Int,
  scale: Int,
  signed: Boolean,
  isTimestampNTZ: Boolean): DataType = sqlType match {
    ...
    case _ =>
      // For unmatched types:
      // including java.sql.Types.ARRAY, DATALINK, DISTINCT, JAVA_OBJECT, NULL, 
OTHER, REF_CURSOR,
      // TIME_WITH_TIMEZONE, TIMESTAMP_WITH_TIMEZONE, and others.
      val jdbcType = classOf[JDBCType].getEnumConstants()
        .find(_.getVendorTypeNumber == sqlType)
        .map(_.getName)
        .getOrElse(sqlType.toString)
      throw QueryExecutionErrors.unrecognizedSqlTypeError(jdbcType, 
typeName){code}
As you can see, the method for translating JDBC types to Spark Catalyst types 
doesn't currently handle ARRAY or MAP, among other types, leading to the error. 
The JDBC schema translation fails when complex types such as ARRAY or MAP are 
present.

{*}Expected Behavior{*}:

Spark should not fail when encountering complex types like {{ARRAY}} or {{MAP}} 
from a Trino JDBC source. Instead, it should either:
 # Convert these complex types into a serialized string format (e.g., JSON) 
over the wire.
 # Provide an option for users to manually handle such complex types after 
loading them into a DataFrame.

  was:
{*}Description{*}:

I am encountering an issue when using Spark to read data from a Trino instance 
via JDBC. Specifically, when querying complex types such as {{ARRAY}} or 
{{MAP}} from Trino, Spark throws an error indicating that it cannot recognize 
these SQL types. Below is the context:

{*}Code Example{*}:

 
{code:java}
val sourceDF = spark.read
  .format("jdbc")
  .option("driver", "io.trino.jdbc.TrinoDriver")
  .option("url", "jdbc:trino://localhost:8181")
  .option("query", "select address from minio.qa.nbcheck1")
  .load(){code}
 

 

*Error Message:*

 
{code:java}
2/04 03:49:59 INFO SparkContext: SparkContext already stopped.
Exception in thread "main" org.apache.spark.SparkSQLException: 
[UNRECOGNIZED_SQL_TYPE] Unrecognized SQL type - name: array (row(city varchar, 
state varchar)), id: ARRAY.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.unrecognizedSqlTypeError(QueryExecutionErrors.scala:992){code}
 

 

{*}Root Cause{*}:

The error seems to be occurring because Spark's JDBC data source does not 
recognize complex SQL types like {{ARRAY}} or {{MAP}} from Trino by default. 
This is confirmed by the following relevant section of Spark's code:

https://github.com/apache/spark/blob/v3.5.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
{code:java}
private def getCatalystType(
  sqlType: Int,
  typeName: String,
  precision: Int,
  scale: Int,
  signed: Boolean,
  isTimestampNTZ: Boolean): DataType = sqlType match {
    ...
    case _ =>
      // For unmatched types:
      // including java.sql.Types.ARRAY, DATALINK, DISTINCT, JAVA_OBJECT, NULL, 
OTHER, REF_CURSOR,
      // TIME_WITH_TIMEZONE, TIMESTAMP_WITH_TIMEZONE, and others.
      val jdbcType = classOf[JDBCType].getEnumConstants()
        .find(_.getVendorTypeNumber == sqlType)
        .map(_.getName)
        .getOrElse(sqlType.toString)
      throw QueryExecutionErrors.unrecognizedSqlTypeError(jdbcType, 
typeName){code}
As you can see, the method for translating JDBC types to Spark Catalyst types 
doesn't currently handle ARRAY or MAP, among other types, leading to the error. 
The JDBC schema translation fails when complex types such as ARRAY or MAP are 
present.

{*}Expected Behavior{*}:

Spark should not fail when encountering complex types like {{ARRAY}} or {{MAP}} 
from a Trino JDBC source. Instead, it should either:
 # Convert these complex types into a serialized string format (e.g., JSON) 
over the wire.
 # Provide an option for users to manually handle such complex types after 
loading them into a DataFrame.

{{}}


> JDBC Reader Fails to Handle Complex Types (Array, Map) from Trino
> -----------------------------------------------------------------
>
>                 Key: SPARK-50656
>                 URL: https://issues.apache.org/jira/browse/SPARK-50656
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Spark Shell
>    Affects Versions: 3.5.4
>         Environment: {*}Environment{*}:
>  * Spark version: 3.5.x 
>  * Trino version: 457
>  * JDBC Driver: {{io.trino.jdbc.TrinoDriver}}
>            Reporter: Narayan Bhawar
>            Priority: Major
>              Labels: Trino, complextype, jdbc, spark
>
> {*}Description{*}:
> I am encountering an issue when using Spark to read data from a Trino 
> instance via JDBC. Specifically, when querying complex types such as 
> {{ARRAY}} or {{MAP}} from Trino, Spark throws an error indicating that it 
> cannot recognize these SQL types. Below is the context:
> {*}Code Example{*}:
>  
> {code:java}
> val sourceDF = spark.read
>   .format("jdbc")
>   .option("driver", "io.trino.jdbc.TrinoDriver")
>   .option("url", "jdbc:trino://localhost:8181")
>   .option("query", "select address from minio.qa.nbcheck1")
>   .load(){code}
>  
>  
> *Error Message:*
>  
> {code:java}
> 2/04 03:49:59 INFO SparkContext: SparkContext already stopped.
> Exception in thread "main" org.apache.spark.SparkSQLException: 
> [UNRECOGNIZED_SQL_TYPE] Unrecognized SQL type - name: array (row(city 
> varchar, state varchar)), id: ARRAY.
> at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.unrecognizedSqlTypeError(QueryExecutionErrors.scala:992){code}
>  
>  
> {*}Root Cause{*}:
> The error seems to be occurring because Spark's JDBC data source does not 
> recognize complex SQL types like {{ARRAY}} or {{MAP}} from Trino by default. 
> This is confirmed by the following relevant section of Spark's code:
> [https://github.com/apache/spark/blob/v3.5.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala]
> {code:java}
> private def getCatalystType(
>   sqlType: Int,
>   typeName: String,
>   precision: Int,
>   scale: Int,
>   signed: Boolean,
>   isTimestampNTZ: Boolean): DataType = sqlType match {
>     ...
>     case _ =>
>       // For unmatched types:
>       // including java.sql.Types.ARRAY, DATALINK, DISTINCT, JAVA_OBJECT, 
> NULL, OTHER, REF_CURSOR,
>       // TIME_WITH_TIMEZONE, TIMESTAMP_WITH_TIMEZONE, and others.
>       val jdbcType = classOf[JDBCType].getEnumConstants()
>         .find(_.getVendorTypeNumber == sqlType)
>         .map(_.getName)
>         .getOrElse(sqlType.toString)
>       throw QueryExecutionErrors.unrecognizedSqlTypeError(jdbcType, 
> typeName){code}
> As you can see, the method for translating JDBC types to Spark Catalyst types 
> doesn't currently handle ARRAY or MAP, among other types, leading to the 
> error. The JDBC schema translation fails when complex types such as ARRAY or 
> MAP are present.
> {*}Expected Behavior{*}:
> Spark should not fail when encountering complex types like {{ARRAY}} or 
> {{MAP}} from a Trino JDBC source. Instead, it should either:
>  # Convert these complex types into a serialized string format (e.g., JSON) 
> over the wire.
>  # Provide an option for users to manually handle such complex types after 
> loading them into a DataFrame.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-50656) JDBC Reader Fails to Handle Complex Types (Array, Map) from Trino

Reply via email to