noahtaite opened a new issue, #10239: URL: https://github.com/apache/hudi/issues/10239
**Describe the problem you faced** I have multiple applications reading our 120 table, 1PB+ Hudi OLAP data lake that are seeing gaps of 1hr+ in our application stages when collecting the data: <img width="1701" alt="image" src="https://github.com/apache/hudi/assets/24283126/a03dd51b-5f0e-4214-a731-2bf81da95926"> Note a 1hr gap between stages 12 + 13 I have been able to consistently reproduce this in my dev environment and see the following behaviour: - Calling .load() on the table finishes quickly. - Calling .count() on a specific partition has all jobs in the Spark History Server complete in under 10 minutes, but then a 1hr gap is observed before the output of the count is reported. - During the gap, my cluster auto-scales down to 1 executor **To Reproduce** Steps to reproduce the behavior: 1. 20TB+ Hudi table with ~250k partitions, metadata enabled. 2. Load + count a single partition. 3. Observe a large gap when just a single executor is running. 4. Slow read performance. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.13.1 * Spark version : 3.4.0 * Hive version : 3.1.3 * Hadoop version : 3.3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No **Additional context** **Stacktrace** I'm just trying to gain a base level understanding of where this time is going or if someone can point me in the correct direction for troubleshooting. The runtime cost is quite low due to the scaling down but analytics developers are not happy with their applications slowing down. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
