noahtaite commented on issue #10183: URL: https://github.com/apache/hudi/issues/10183#issuecomment-1828629712
@Cpandey43 Hey! I'm another Hudi 0.13.1 MOR user, so just thought I'd come by to help lend a hand and dig a bit deeper into the problem you're reporting. **To be clear - I'm not a Hudi developer.** **First question** - are you ingesting any updated records or only new records? From what I understand, the expected behaviour for MoR table (which I can confirm you are using based on your configurations + properties) is: - new records are written to base .parquet files - updates to records are written to avro .log files New records need to get associated with a parquet file before updates get logged to avro files. I'm not a Hudi developer but to me it seems like this is the case partially so that a read optimized query will correctly show new records in the data set. However updates will not show until compaction is ran against those log files. It is how I've always understood the tradeoff here. **Second question** - what is the behaviour you expect to see with async clustering? It should indeed be a "no-operation" until you asynchronously schedule + execute clustering to stitch together small files / improve query performance by having control over data locality. I suggest taking a look at the [following guide here](https://hudi.apache.org/docs/clustering/) and [appropriate 0.13.1 configurations](https://hudi.apache.org/docs/0.13.1/configurations) to see what suits your needs: ``` hoodie.clustering.async.enabled hoodie.clustering.inline hoodie.clustering.schedule.inline ``` Out of the box these are all disabled, so we wouldn't see any clustering actions in your timeline. Also, be aware that when you do use clustering, you would see a ".replacecommit" in your Hudi timeline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
