[I] [SUPPORT] Large gap between stages on read [hudi]

via GitHub Mon, 04 Dec 2023 09:27:26 -0800


noahtaite opened a new issue, #10239:
URL: https://github.com/apache/hudi/issues/10239

**Describe the problem you faced**

I have multiple applications reading our 120 table, 1PB+ Hudi OLAP data lake
that are seeing gaps of 1hr+ in our application stages when collecting the data:

Note a 1hr gap between stages 12 + 13

I have been able to consistently reproduce this in my dev environment and
see the following behaviour:
- Calling .load() on the table finishes quickly.
- Calling .count() on a specific partition has all jobs in the Spark History
Server complete in under 10 minutes, but then a 1hr gap is observed before the
output of the count is reported.
- During the gap, my cluster auto-scales down to 1 executor

**To Reproduce**

Steps to reproduce the behavior:

1. 20TB+ Hudi table with ~250k partitions, metadata enabled.
2. Load + count a single partition.
3. Observe a large gap when just a single executor is running.
4. Slow read performance.

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version : 0.13.1

* Spark version : 3.4.0

* Hive version : 3.1.3

* Hadoop version : 3.3.3

* Storage (HDFS/S3/GCS..) : S3

* Running on Docker? (yes/no) : No

**Additional context**

**Stacktrace**

I'm just trying to gain a base level understanding of where this time is
going or if someone can point me in the correct direction for troubleshooting.
The runtime cost is quite low due to the scaling down but analytics developers
are not happy with their applications slowing down.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Large gap between stages on read [hudi]

Reply via email to