shahiidiqbal opened a new issue #4839:
URL: https://github.com/apache/hudi/issues/4839


   We have a spark streaming job to ingest real-time data coming through Kafka 
connect from MongoDB. Somehow Hudi upsert doesn't trigger compaction and if we 
look at the partition folders there are 1000s of log files but no parquet. 
There are lots of files including .commits_.archive, .clean, .clean.inflight, 
.clean.requested, .deltacommits, sdeltcommits.inflight, .deltacommits.requested 
in hoodi folder.
   
   I used following cli commands but they are showing nothing/empty records
   cleans show
   compactions show all
   show fsview all
   
   Here are the configs we use for hudi spark streaming on AWS EMR 6 and Hudi 
0.9
   
        'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
       'hoodie.datasource.write.recordkey.field': 'entity_id',
       'hoodie.datasource.hive_sync.enable': 'true',
       'hoodie.datasource.hive_sync.database': DATABASE_NAME,
       'hoodie.datasource.write.hive_style_partitioning': 'true',
       'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
       'hoodie.datasource.hive_sync.jdbcurl': 'jdbc:hive2://url',
       'hoodie.datasource.write.payload.class': 
'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
       'hoodie.datasource.write.precombine.field': 'event_time',
       'hoodie.payload.ordering.field': 'event_time',
       'hoodie.datasource.write.operation': 'upsert',
       'hoodie.index.type' : 'GLOBAL_BLOOM',
       'hoodie.upsert.shuffle.parallelism': 2,
       'hoodie.insert.shuffle.parallelism': 2
   'hoodie.table.name': TABLE_NAME,
       'hoodie.datasource.hive_sync.table': TABLE_NAME,    
       'hoodie.datasource.write.partitionpath.field': 'partition_date',
       'hoodie.datasource.hive_sync.partition_fields': 'partition_date',
   
   Can anyone help what we are missing to enable compaction and how it works? 
because are unable to get updated data from read optimized view _ro table even 
after hours.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to