[
https://issues.apache.org/jira/browse/HUDI-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152766#comment-17152766
]
Adrian Tanase edited comment on HUDI-1079 at 7/7/20, 2:07 PM:
--------------------------------------------------------------
Quick update, I thought it's related to the nullability of the columns and made
them all not null:
{noformat}
root
|-- name: string (nullable = false)
|-- booksIntersted: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- bookName: string (nullable = false)
2020-07-07 16:58:55,957 [main] INFO org.apache.hudi.HoodieSparkSqlWriter$ -
Registered avro schema : {
"type" : "record",
"name" : "books_demo_cow_record",
"namespace" : "hoodie.books_demo_cow",
"fields" : [ {
"name" : "name",
"type" : "string"
}, {
"name" : "booksIntersted",
"type" : {
"type" : "array",
"items" : {
"type" : "record",
"name" : "booksIntersted",
"namespace" : "hoodie.books_demo_cow.books_demo_cow_record",
"fields" : [ {
"name" : "bookName",
"type" : "string"
} ]
}
}
} ]
}
{noformat}
The reader/writer compatibility test in the AvroRecordConverter code block from
above fails with this message:
{noformat}
Data encoded using writer schema:
{
"type" : "record",
"name" : "array",
"fields" : [ {
"name" : "bookName",
"type" : "string"
} ]
}
will or may fail to decode using reader schema:
{
"type" : "record",
"name" : "booksIntersted",
"namespace" : "hoodie.books_demo_cow.books_demo_cow_record",
"fields" : [ {
"name" : "bookName",
"type" : "string"
} ]
}
{noformat}
was (Author: tase):
Quick update, I thought it's related to the nullability of the columns and made
them all not null:
{{root}}
{{ |-- name: string (nullable = false)}}
{{ |-- booksIntersted: array (nullable = false)}}
{{ | |-- element: struct (containsNull = false)}}
{{ | | |-- bookName: string (nullable = false)}}
{{2020-07-07 16:58:55,957 [main] INFO org.apache.hudi.HoodieSparkSqlWriter$ -
Registered avro schema : {}}
{{ "type" : "record",}}
{{ "name" : "books_demo_cow_record",}}
{{ "namespace" : "hoodie.books_demo_cow",}}
{{ "fields" : [ {}}
{{ "name" : "name",}}
{{ "type" : "string"}}
{{ }, {}}
{{ "name" : "booksIntersted",}}
{{ "type" : {}}
{{ "type" : "array",}}
{{ "items" : {}}
{{ "type" : "record",}}
{{ "name" : "booksIntersted",}}
{{ "namespace" : "hoodie.books_demo_cow.books_demo_cow_record",}}
{{ "fields" : [ {}}
{{ "name" : "bookName",}}
{{ "type" : "string"}}
{{ } ]}}
{{ }}}
{{ }}}
{{ } ]}}
{{}}}
The reader/writer compatibility test in the AvroRecordConverter code block from
above fails with this message:
{{Data encoded using writer schema:}}
{{{}}
{{ “type” : “record”,}}
{{ “name” : “array”,}}
{{ “fields” : [ {}}
{{ “name” : “bookName”,}}
{{ “type” : “string”}}
{{ } ]}}
{{}}}
{{will or may fail to decode using reader schema:}}
{{{}}
{{ “type” : “record”,}}
{{ “name” : “booksIntersted”,}}
{{ “namespace” : “hoodie.books_demo_cow.books_demo_cow_record”,}}
{{ “fields” : [ {}}
{{ “name” : “bookName”,}}
{{ “type” : “string”}}
{{ } ]}}
{{}}}
{noformat}
{noformat}
> Cannot upsert on schema with Array of Record with single field
> --------------------------------------------------------------
>
> Key: HUDI-1079
> URL: https://issues.apache.org/jira/browse/HUDI-1079
> Project: Apache Hudi
> Issue Type: Bug
> Components: Spark Integration
> Affects Versions: 0.5.3
> Environment: spark 2.4.4, local
> Reporter: Adrian Tanase
> Priority: Major
>
> I am trying to trigger upserts on a table that has an array field with
> records of just one field.
> Here is the code to reproduce:
> {code:scala}
> val spark = SparkSession.builder()
> .master("local[1]")
> .appName("SparkByExamples.com")
> .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")
> .getOrCreate();
> // https://sparkbyexamples.com/spark/spark-dataframe-array-of-struct/
> val arrayStructData = Seq(
> Row("James",List(Row("Java","XX",120),Row("Scala","XA",300))),
> Row("Michael",List(Row("Java","XY",200),Row("Scala","XB",500))),
> Row("Robert",List(Row("Java","XZ",400),Row("Scala","XC",250))),
> Row("Washington",null)
> )
> val arrayStructSchema = new StructType()
> .add("name",StringType)
> .add("booksIntersted",ArrayType(
> new StructType()
> .add("bookName",StringType)
> // .add("author",StringType)
> // .add("pages",IntegerType)
> ))
> val df =
> spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
> {code}
> Running insert following by upsert will fail:
> {code:scala}
> df.write
> .format("hudi")
> .options(getQuickstartWriteConfigs)
> .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "name")
> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "name")
> .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE")
> .option(HoodieWriteConfig.TABLE_NAME, tableName)
> .mode(Overwrite)
> .save(basePath)
> df.write
> .format("hudi")
> .options(getQuickstartWriteConfigs)
> .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "name")
> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "name")
> .option(HoodieWriteConfig.TABLE_NAME, tableName)
> .mode(Append)
> .save(basePath)
> {code}
> If I create the books record with all the fields (at least 2), it works as
> expected.
> Another way to test is by changing the generated data in the tips to just the
> amount, by dropping the currency on the tips_history field, tests will start
> failing:
>
> [https://github.com/apache/hudi/compare/release-0.5.3...aditanase:avro-arrays-upsert?expand=1]
> I have narrowed this down to this block in the parquet-avro integration:
> [https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L846-L875]
> Which always returns false after trying to decide whether reader and writer
> schemas are compatible. Going through that code path makes me thing it's
> related to the fields being optional, as the inferred schema seems to be
> (null, string) with default null instead of (string, null) with no default.
> At this point I'm lost, tried to figure something out based on this
> [https://github.com/apache/hudi/pull/1406/files] but I'm not sure where to
> start.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)