[
https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-43201:
---------------------------------
Component/s: SQL
(was: Spark Core)
> Inconsistency between from_avro and from_json function
> ------------------------------------------------------
>
> Key: SPARK-43201
> URL: https://issues.apache.org/jira/browse/SPARK-43201
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.4.0
> Reporter: Philip Adetiloye
> Priority: Major
>
> Spark from_avro function does not allow schema to use dataframe column but
> takes a String schema:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: String): Column {code}
> This makes it impossible to deserialize rows of Avro records with different
> schema since only one schema string could be pass externally.
>
> Here is what I would expect:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: Column): Column {code}
> code example:
> {code:java}
> import org.apache.spark.sql.functions.from_avro
> val avroSchema1 =
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
>
> val avroSchema2 =
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
> val df = Seq(
> (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
> (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
> ).toDF("binaryData", "schema")
> val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))
> parsed.show()
> // Output:
> // +------------+
> // | parsedData|
> // +------------+
> // |[apple1, 1.0]|
> // |[apple2, 2.0]|
> // +------------+
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]