prashantwason opened a new pull request #2495:
URL: https://github.com/apache/hudi/pull/2495
## What is the purpose of the pull request
TimelineServer uses Javalin which is based on Jetty.
By default Jetty:
Has 200 threads
Compresses output by gzip
Handles each request sequentially
On a large-scale HUDI dataset (2000 partitions), when TimelineServer is
enabled, the operations slow down due to following reasons:
- Driver process usually has a few cores. 200 Jetty threads lead to huge
contention when 100s of executors connect to the Server in parallel.
- To handle large number of requests in parallel, its better to handle each
HTTP request in an asynchronous manner using Futures which are supported by
Javalin.
- The compute overhead of gzipping may not be necessary when the executors
and driver are in the same rack or within the same datacenter
## Brief change log
Added settings to control the number of threads created, whether to gzip
output and to use asynchronous processing of requests.
With all the settings enabled, a driver process with 8 cores is able to
handle 1024 executors in parallel on a table with 2000 partitions (CLEAN
operation which lists all partitions). The time per API requests was also
reduced from 800msec to 60msec.
## Verify this pull request
This pull request is already covered by existing tests, such as
TimelineServer tests and integration tests.
## Committer checklist
- [ ] Has a corresponding JIRA in PR title & commit
- [ ] Commit message is descriptive of the change
- [ ] CI is green
- [ ] Necessary doc changes done or have another open PR
- [ ] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]