shahiidiqbal opened a new issue #4839:
URL: https://github.com/apache/hudi/issues/4839
We have a spark streaming job to ingest real-time data coming through Kafka
connect from MongoDB. Somehow Hudi upsert doesn't trigger compaction and if we
look at the partition folders there are 1000s of log files but no parquet.
There are lots of files including .commits_.archive, .clean, .clean.inflight,
.clean.requested, .deltacommits, sdeltcommits.inflight, .deltacommits.requested
in hoodi folder.
I used following cli commands but they are showing nothing/empty records
cleans show
compactions show all
show fsview all
Here are the configs we use for hudi spark streaming on AWS EMR 6 and Hudi
0.9
'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
'hoodie.datasource.write.recordkey.field': 'entity_id',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.database': DATABASE_NAME,
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.jdbcurl': 'jdbc:hive2://url',
'hoodie.datasource.write.payload.class':
'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
'hoodie.datasource.write.precombine.field': 'event_time',
'hoodie.payload.ordering.field': 'event_time',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.index.type' : 'GLOBAL_BLOOM',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
'hoodie.table.name': TABLE_NAME,
'hoodie.datasource.hive_sync.table': TABLE_NAME,
'hoodie.datasource.write.partitionpath.field': 'partition_date',
'hoodie.datasource.hive_sync.partition_fields': 'partition_date',
Can anyone help what we are missing to enable compaction and how it works?
because are unable to get updated data from read optimized view _ro table even
after hours.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]