sam-wmt opened a new issue #1782: URL: https://github.com/apache/hudi/issues/1782
**Describe the problem you faced** Currently we are streaming data, upserting into a Merge-On-Read table. The total table will contain 350M entities bounded, and we expect the approximate table size to be 10+ TB. For the first 10M records the batches were completing quickly however we started to see the batch time slowly growing by about 1 minute per 5-10 batches. We have enable inline compaction for testing and set INLINE_COMPACT_NUM_DELTA_COMMITS_PROP and INLINE_COMPACT_PROP to 12 and are currently running 15 minute batches in Spark. **Expected behavior** With the Merge-On-Read semantics I would expect most writes to be quick and the compaction to take longer and longer. however I see no difference in speed across the 12 batches they all take about the same amount of time **Environment Description** Spark running in Azure, Storage Layer ADLS_V2 and cross tested with GCS **Environment Description** Hudi version : 0.5.3 Spark version : 2.4-1.0.5 Hive version : Hadoop version : 2.7 Storage (HDFS/S3/GCS..) : ADLSv3 and GCS Running on Docker? (yes/no) : yes and no, both **Additional context** Add any other context about the problem here. **Stacktrace** No exceptions are thrown, batch times are degrading. @christoph-wmt ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
