Nitish-sati opened a new issue, #9727:
URL: https://github.com/apache/hudi/issues/9727
I'm currently operating a Spark Streaming job on an EMR cluster, where it
retrieves data from an S3 source, performs upsert operations, and then stores
it in the Hudi format. Additionally, I'm utilizing a separate EMR cluster
dedicated to HBase for indexing purposes.
Recently, I encountered a problem related to corruption in the Parquet files
within the target data Hudi table. Consequently, during the execution of one of
the micro-batches, an error surfaced, indicating Parquet file corruption. We
diligently reviewed the YARN container logs in an attempt to identify any clues
pertaining to the file corruption issue, but unfortunately, we could not
uncover any relevant information.
**To Reproduce**
not able to reproduce the issue.
**Expected behavior**
Ideally it should upsert the latest record for each primary key into target
location.
**Environment Description**
* Hudi version : 0.9
* Spark version :2.4.8
* Hive version :2.3.9
* Hadoop version :2.10.1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
*Hudi configs*
> "hoodie.clean.async": "false",
> "hoodie.clean.automatic": "true",
> "hoodie.cleaner.commits.retained": "10",
> "hoodie.cleaner.parallelism": "700",
> "hoodie.consistency.check.enabled": "true",
> "hoodie.datasource.hive_sync.assume_date_partitioning":
"false",
> "hoodie.datasource.hive_sync.database": "db_name",
> "hoodie.datasource.hive_sync.enable": "true",
> "hoodie.datasource.hive_sync.jdbcurl": "jdbc_url",
> "hoodie.datasource.hive_sync.partition_extractor_class":
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
> "hoodie.datasource.hive_sync.partition_fields": "col_1",
> "hoodie.datasource.hive_sync.table": "hudi_table_name",
> "hoodie.datasource.write.hive_style_partitioning": "true",
> "hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.ComplexKeyGenerator",
> "hoodie.payload.ordering.field": "col_2",
> "hoodie.datasource.write.operation": "upsert",
> "hoodie.datasource.write.partitionpath.field": "col_1",
> "hoodie.datasource.write.precombine.field": "col_2",
> "hoodie.datasource.write.recordkey.field": "col_3",
> "hoodie.datasource.write.streaming.ignore.failed.batch":
"false",
> "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
> "hoodie.avro.schema.validate": "true",
> "hoodie.hbase.index.update.partition.path": "true",
> "hoodie.index.hbase.get.batch.size": "1000",
> "hoodie.index.hbase.max.qps.per.region.server": "1000",
> "hoodie.index.hbase.put.batch.size": "1000",
> "hoodie.index.hbase.qps.allocator.class":
"org.apache.hudi.index.hbase.DefaultHBaseQPSResourceAllocator",
> "hoodie.index.hbase.qps.fraction": "0.5f",
> "hoodie.index.hbase.max.qps.fraction": "10000",
> "hoodie.index.hbase.min.qps.fraction": "1000",
> "hoodie.index.hbase.put.batch.size.autocompute": "true",
> "hoodie.index.hbase.rollback.sync": "true",
> "hoodie.index.hbase.table": "hbase_table_name",
> "hoodie.index.hbase.zknode.path": "/hbase",
> "hoodie.index.hbase.zkport": "2181",
> "hoodie.index.hbase.zkquorum": "hbase_cluster_dns",
> "hoodie.index.type": "HBASE",
> "hoodie.memory.compaction.fraction": "0.8",
> "hoodie.parquet.block.size": "152043520",
> "hoodie.parquet.compression.codec": "snappy",
> "hoodie.parquet.max.file.size": "152043520",
> "hoodie.parquet.small.file.limit": "104857600",
> "hoodie.table.name": "hudi_table_name",
> "hoodie.upsert.shuffle.parallelism": "50"
**Stacktrace**
```
23/09/06 05:39:32 INFO S3NativeFileSystem: Opening
's3://**/**/**/**/**.parquet' for reading
23/09/06 05:39:32 ERROR BoundedInMemoryExecutor: error producing records
org.apache.hudi.exception.HoodieIOException: unable to read next record from
parquet file
at
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:53)
at
org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
at
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: can not read class
org.apache.parquet.format.FileMetaData: Required field 'name' was not present!
Struct: SchemaElement(name:null, field_id:6)
at org.apache.parquet.format.Util.read(Util.java:216)
at org.apache.parquet.format.Util.readFileMetaData(Util.java:73)
at
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:873)
at
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:870)
at
org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:747)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:870)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:540)
at
org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:715)
at
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:603)
at
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:152)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
at
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
... 8 more
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException:
Required field 'name' was not present! Struct: SchemaElement(name:null,
field_id:6)
at
org.apache.parquet.format.SchemaElement.validate(SchemaElement.java:1201)
at
org.apache.parquet.format.SchemaElement$SchemaElementStandardScheme.read(SchemaElement.java:1331)
at
org.apache.parquet.format.SchemaElement$SchemaElementStandardScheme.read(SchemaElement.java:1230)
at org.apache.parquet.format.SchemaElement.read(SchemaElement.java:1105)
at
org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.read(FileMetaData.java:1080)
at
org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.read(FileMetaData.java:1051)
at org.apache.parquet.format.FileMetaData.read(FileMetaData.java:945)
at org.apache.parquet.format.Util.read(Util.java:213)
... 19 more
23/09/06 05:39:33 ERROR BoundedInMemoryExecutor: error consuming records
org.apache.hudi.exception.HoodieException: operation has failed
at
org.apache.hudi.common.util.queue.BoundedInMemoryQueue.throwExceptionIfFailed(BoundedInMemoryQueue.java:247)
at
org.apache.hudi.common.util.queue.BoundedInMemoryQueue.readNextRecord(BoundedInMemoryQueue.java:226)
at
org.apache.hudi.common.util.queue.BoundedInMemoryQueue.access$100(BoundedInMemoryQueue.java:52)
at
org.apache.hudi.common.util.queue.BoundedInMemoryQueue$QueueIterator.hasNext(BoundedInMemoryQueue.java:277)
at
org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:36)
at
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hudi.exception.HoodieIOException: unable to read next
record from parquet file
at
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:53)
at
org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
at
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
... 4 more
Caused by: java.io.IOException: can not read class
org.apache.parquet.format.FileMetaData: Required field 'name' was not present!
Struct: SchemaElement(name:null, field_id:6)
at org.apache.parquet.format.Util.read(Util.java:216)
at org.apache.parquet.format.Util.readFileMetaData(Util.java:73)
at
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:873)
at
org.apache.parquet.format.converter.ParquetMetadataConverter$2.visit(ParquetMetadataConverter.java:870)
at
org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:747)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:870)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:540)
at
org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:715)
at
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:603)
at
org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:152)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
at
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
... 8 more
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException:
Required field 'name' was not present! Struct: SchemaElement(name:null,
field_id:6)
at
org.apache.parquet.format.SchemaElement.validate(SchemaElement.java:1201)
at
org.apache.parquet.format.SchemaElement$SchemaElementStandardScheme.read(SchemaElement.java:1331)
at
org.apache.parquet.format.SchemaElement$SchemaElementStandardScheme.read(SchemaElement.java:1230)
at org.apache.parquet.format.SchemaElement.read(SchemaElement.java:1105)
at
org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.read(FileMetaData.java:1080)
at
org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.read(FileMetaData.java:1051)
at org.apache.parquet.format.FileMetaData.read(FileMetaData.java:945)
at org.apache.parquet.format.Util.read(Util.java:213)
... 19 more```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]