[ 
https://issues.apache.org/jira/browse/HUDI-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107561#comment-17107561
 ] 

Vinoth Chandar commented on HUDI-864:
-------------------------------------

oops.. slipped past my radar.. 

I prefer not to get into shading parquet.. The way we have right now, we just 
use parquet version on the engine, which makes it easy for users to 
troubleshoot issues based on what applies to the engine.. 

1 & 2 seem good to me. 

> parquet schema conflict: optional binary <some-field> (UTF8) is not a group
> ---------------------------------------------------------------------------
>
>                 Key: HUDI-864
>                 URL: https://issues.apache.org/jira/browse/HUDI-864
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>            Reporter: Roland Johann
>            Priority: Major
>              Labels: bug-bash-0.6.0
>
> When dealing with struct types like this
> {code:json}
> {
>   "type": "struct",
>   "fields": [
>     {
>       "name": "categoryResults",
>       "type": {
>         "type": "array",
>         "elementType": {
>           "type": "struct",
>           "fields": [
>             {
>               "name": "categoryId",
>               "type": "string",
>               "nullable": true,
>               "metadata": {}
>             }
>           ]
>         },
>         "containsNull": true
>       },
>       "nullable": true,
>       "metadata": {}
>     }
>   ]
> }
> {code}
> The second ingest batch throws that exception:
> {code}
> ERROR [Executor task launch worker for task 15] 
> commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error 
> upserting bucketType UPDATE for partition :0
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: operation has failed
>       at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100)
>       at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76)
>       at 
> org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73)
>       at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258)
>       at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271)
>       at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104)
>       at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>       at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>       at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>       at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>       at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
>       at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
>       at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>       at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>       at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>       at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>       at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>       at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>       at org.apache.spark.scheduler.Task.run(Task.scala:123)
>       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: operation has failed
>       at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:143)
>       at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:98)
>       ... 34 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: operation has failed
>       at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>       at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:141)
>       ... 35 more
> Caused by: org.apache.hudi.exception.HoodieException: operation has failed
>       at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueue.throwExceptionIfFailed(BoundedInMemoryQueue.java:227)
>       at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueue.readNextRecord(BoundedInMemoryQueue.java:206)
>       at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueue.access$100(BoundedInMemoryQueue.java:52)
>       at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueue$QueueIterator.hasNext(BoundedInMemoryQueue.java:257)
>       at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:36)
>       at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       ... 3 more
> Caused by: java.lang.ClassCastException: optional binary categoryId (UTF8) is 
> not a group
>       at org.apache.parquet.schema.Type.asGroupType(Type.java:207)
>       at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
>       at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:232)
>       at 
> org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:78)
>       at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:536)
>       at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:486)
>       at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)
>       at 
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141)
>       at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
>       at 
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141)
>       at 
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
>       at 
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
>       at 
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
>       at 
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
>       at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
>       at 
> org.apache.hudi.client.utils.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
>       at 
> org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
>       at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       ... 4 more
> {code}
> Parquet schema of the failing struct
> {code}
> optional group categoryResults (LIST) {
>   repeated group array {
>     optional binary categoryId (UTF8);
>   }
> }
> {code}
> When the leaf record has multiple fields the issue has gone. I assume that 
> this issue relates to either parquet/avro. Following array of struct 
> definition is handled fine withtout exception:
> {code}
>     optional group productResult (LIST) {
>       repeated group array {
>         optional binary productId (UTF8);
>         optional boolean productImages;
>         optional binary productShortDescription (UTF8);
>       }
>     }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to