phillycoder opened a new issue, #5985:
URL: https://github.com/apache/hudi/issues/5985
**Describe the problem you faced**
Getting `java.lang.ClassCastException: optional binary xx (STRING)`
exception when a record get updated.
Also this issue specifically happening when a field is array of structs with
one field.
But the issue is not happening when array of structs have more than one
field.
**To Reproduce**
Steps to reproduce the behavior:
launch spark-shell
```
./spark-shell --conf
"spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.hive.convertMetastoreParquet=false" \
--jars
$HOME/.m2/repository/org/apache/hudi/hudi-spark3-bundle_2.12/0.10.0/hudi-spark3-bundle_2.12-0.10.0.jar,$HOME/.m2/repository/org/apache/spark/spark-avro_2.12/3.2.0/spark-avro_2.12-3.2.0.jar
```
```
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val tableName = "hudi_cow"
val basePath = "/tmp/hudi_cow"
val schema = StructType( Array(
| StructField("rowId", StringType,true),
| StructField("preComb", LongType,true),
| StructField("name", StringType,true),
| StructField("valObjs", ArrayType(StructType(Array(
| StructField("id", StringType)
| ))))
| ))
val data1 = Seq(Row("row_1", 0L, "test", Array()),
| Row("row_2", 0L, "test", Array()),
| Row("row_3", 0L, "test", Array()))
var dfFromData1 = spark.createDataFrame(data1, schema)
dfFromData1.printSchema
dfFromData1.show
dfFromData1.write.format("hudi").
| option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
| option(RECORDKEY_FIELD_OPT_KEY, "rowId").
| option(TABLE_NAME, tableName).
| mode(Overwrite).
| save(basePath)
var snapshotDF1 = spark.read.format("hudi").load(basePath + "/*")
snapshotDF1.createOrReplaceTempView("hudi_snapshot")
spark.sql("select rowId, preComb, name from hudi_snapshot").show()
dfFromData1.write.format("hudi").
| options(getQuickstartWriteConfigs).
| option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
| option(RECORDKEY_FIELD_OPT_KEY, "rowId").
| option(TABLE_NAME, tableName).
| mode(Append).
| save(basePath)
```
when updating records (second save) hudi throwing
```
22/06/27 14:56:41 ERROR BoundedInMemoryExecutor: error producing records
org.apache.hudi.exception.HoodieException: unable to read next record from
parquet file
at
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:54)
at
org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
at
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassCastException: optional binary id (STRING) is not
a group
at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
at
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
at
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:232)
at
org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:78)
at
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:536)
at
org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:486)
at
org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)
at
org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141)
at
org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
at
org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
at
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:185)
at
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
at
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
... 8 more
```
weird thing is if we have more than one field in that array of structs field
(valObjs), update works.
for example this schema works with above example, please note added secondid
field to valObjs
```
val schema = StructType( Array(
| StructField("rowId", StringType,true),
| StructField("preComb", LongType,true),
| StructField("name", StringType,true),
| StructField("valObjs", ArrayType(StructType(Array(
| StructField("id", StringType),
| StructField("secondid", StringType)
| ))))
| ))
```
**Expected behavior**
Update should work.
**Environment Description**
* Hudi version : 0.10.1 & 0.11.1
* Spark version : 3.2.0
* Hadoop version : 2.7
* Storage (HDFS/S3/GCS..) : Tested using local spark-shell and in emr
* Running on Docker? (yes/no) : In Mac, also same error in EMR 6.6.0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]