[jira] [Commented] (HUDI-7544) Harden, Stress and Performance test the LSM timeline on cloud storage

Sagar Sumit (Jira) Tue, 08 Oct 2024 19:37:24 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-7544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887740#comment-17887740
 ]


Sagar Sumit commented on HUDI-7544:
-----------------------------------

h3. Relevant Code
{{LSMTimeline}} - Lays out the LSM tree layout of the timeline. * Layout 
Overview
 ** *LSM Tree Structure:*
 *** {*}Layer 0 (L0){*}: The most recent instants (timeline actions) are stored 
here in smaller files, one batch at a time.
 *** {*}Higher Layers (L1, L2, etc.){*}: When the number of files in L0 exceeds 
a threshold (default is 10), a compaction process merges these files into 
larger files in L1, and the process repeats for subsequent layers.
 *** {*}Manifests{*}: The timeline uses *manifest files* to track the valid 
files in each snapshot version. When a new batch is flushed into L0 or a 
compaction happens, a new manifest is generated, recording the current valid 
files for that snapshot.
 *** {*}Version Files{*}: After each manifest file is generated, a *version 
file* is created to track the version of the snapshot. Each new snapshot is a 
new version of the timeline.
 ** {*}LSM Timeline Compaction{*}:
 *** The LSM tree uses a *universal compaction strategy* where files in L0 are 
compacted into higher levels (e.g., L1, L2) when a threshold number of files is 
reached. This helps in managing the size of the timeline effectively while 
supporting efficient reads.
 *** Example: If each file in L0 contains 10 instants, then after compaction, 
each file in L1 could contain 100 instants, and each file in L2 could represent 
1000 instants, thereby scaling well for large timelines.
 ** {*}Reader Workflow{*}:
 *** The LSM timeline supports *snapshot isolation* for readers and writers. A 
reader can read the latest version of the timeline by:
 **** Reading the latest *version file* to determine the current snapshot.
 **** Accessing the corresponding *manifest file* for the snapshot to get the 
list of valid data files.
 **** Reading the data from the *parquet files* using optimizations like data 
skipping based on timestamp ranges in file names.
 ** *Clean Strategy:*
 *** Implements a cleaning strategy to retain only a limited number of snapshot 
versions (default is 3). After a valid compaction, older files are removed, but 
at least 3 versions of the snapshot are kept to ensure recovery and rollback 
options are available.
 *** There is also an *Instant TTL* mechanism to discard instants that are 
older than a certain threshold (based on days), ensuring that very old data 
does not unnecessarily burden the system.

{{LSMTimelineWriter}} - Organizes the timeline files as an LSM tree as 
specified in LSMTimeline.
{{CompletionTimeQueryView}} - Maps start time to completion time for completion 
time based file slicing. This might involve lazily loading the LSM timeline.
h3. *Write Access Patterns*
 * {*}Commit Writes{*}:
 ** {*}Frequency{*}: Multiple times during a commit. Each commit operation may 
trigger archival of older eligible commit.
 ** {*}Components{*}: {{LSMTimelineWriter}} (used in 
{{{}HoodieTimelineArchiver{}}}) writes new instants into the active timeline 
and performs compaction based on thresholds.
 ** {*}Use Cases{*}: New commits, updates, compaction to higher layers, and 
clean-up of old timeline files.
 * {*}Compaction Writes{*}:
 ** {*}Frequency{*}: Once per layer when enough files are accumulated in L0.
 ** {*}Components{*}: {{LSMTimelineWriter}} compacts timeline files into higher 
layers when the number of files in L0 exceeds the threshold.
 ** {*}Use Cases{*}: Maintaining performance by reducing the number of small 
files and improving read efficiency.
 * {*}Manifest File Writes{*}:
 ** {*}Frequency{*}: On every flush to L0 or after compaction to L1 or higher.
 ** {*}Components{*}: Manifest files keep track of valid data files and 
snapshot versions.
 ** {*}Use Cases{*}: Snapshot isolation, ensuring consistency of timeline 
versions.

h3. *Read Access Patterns*
 * {*}Query-Related Reads{*}:
 ** {*}Frequency{*}: Once per query.
 ** {*}Components{*}: {{CompletionTimeQueryView}} queries the timeline to map 
start times to completion times and check if instants are archived or completed.
 ** {*}Use Cases{*}: Fetching commit information, ensuring that the correct 
timeline view is available.
 * {*}Archival Reads{*}:
 ** {*}Frequency{*}: During compaction and query retrieval.
 ** {*}Components{*}: {{{}HoodieArchivedTimeline{}}}, fetching older versions 
from the archived timeline.
 ** {*}Use Cases{*}: Data recovery, rollback, time travel, and historical 
snapshots.

h3. *Defaults Affecting Performance*
 * {{{}config.getTimelineCompactionBatchSize(){}}}:
 ** This controls the number of files that trigger compaction in L0. Setting 
this value too low can result in excessive compaction; setting it too high 
might result in too many small files.
 * {{{}max file size{}}}:
 ** The file size limit ({{{}MAX_FILE_SIZE_IN_BYTES{}}}) affects the frequency 
of file flushes. The default is set to 1GB.
 * {*}Completion Time Query{*}:
 ** {*}Default{*}: Queries start with an eager load for the last 3 days of 
completed instants.
 ** {*}Effect{*}: By restricting the load to a recent window, query performance 
is improved. A query for older instants will lazily load the required data from 
the archive.
 * {*}Compaction and Cleaning Frequency{*}:
 ** LSM compaction happens when files exceed a threshold (default 10 files), 
but cleaning is triggered after compaction. Testing the balance between 
compaction and cleaning frequency is important for maintaining performance 
without overwhelming resources.

h3. Test Plan
h4. *1. Performance Testing*
*1.1 Performance Benchmark Overview* * {*}Goal{*}: Test the performance of LSM 
Timeline under varying workloads and configurations. Benchmark the speed of 
writing, reading, and compacting instants, including cases with metadata and 
lazy loading from cloud storage.

*1.2 Performance Test Scenarios* * {*}Test Setup{*}:
 ** Extend the provided benchmark to work with *cloud storage* (e.g., AWS S3, 
GCS).
 ** Simulate different commit sizes (e.g., small vs large commits), as this 
will affect the size and frequency of L0 file flushes and compactions.
 ** Run the test across different datasets (small, medium, and large).
 * {*}Performance Metrics{*}:
 ** {*}Time to Write Instants{*}: Measure how long it takes to write instants 
to the LSM timeline. This includes the time to flush instants to L0, compact to 
higher layers, and update manifests.
 ** {*}Query Latency{*}: Measure the time taken to query the timeline for:
 *** Slim instants.
 *** Instants with commit metadata.
 *** Completion time queries.
 ** {*}Compaction Time{*}: Measure how long it takes to compact files when L0 
reaches the threshold, especially for large datasets.

*1.3 Performance Benchmark Tests* * {*}Test 1: Basic Read/Write Performance{*}:
 ** {*}Scenario{*}: Run the benchmark with default settings on a {*}local 
filesystem{*}.
 ** {*}Metrics{*}: Measure the time to write and read back instants, including 
reading commit metadata and start times.
 ** {*}Goal{*}: Establish a baseline for the performance of LSM timeline on 
local storage.
 * {*}Test 2: Cloud Storage Performance{*}:
 ** {*}Scenario{*}: Extend the benchmark to run on *cloud storage* (e.g., S3 or 
GCS). Benchmark the time taken to write and read instants from cloud storage, 
considering network latency and API calls.
 ** {*}Metrics{*}: Measure the same metrics as Test 1, but also track the 
*number of cloud API calls* made for reading/writing data.
 ** {*}Goal{*}: Verify that LSM timeline performs efficiently in cloud 
environments and that the number of cloud API calls is minimized.
 * {*}Test 3: Large Dataset Performance{*}:
 ** {*}Scenario{*}: Run the benchmark on a *large dataset* (e.g., 10M-50M 
commits) with cloud storage, focusing on the performance impact of handling a 
large number of instants.
 ** {*}Metrics{*}: Measure the total time to read/write and compact files. Pay 
attention to the memory footprint and cloud API usage.
 ** {*}Goal{*}: Ensure that the LSM timeline remains efficient and scalable for 
large datasets without significant degradation in performance.
 * {*}Test 4: Lazy Loading Performance{*}:
 ** {*}Scenario{*}: Benchmark lazy loading behavior by querying for older 
instants that are archived (i.e., beyond the initial eagerly-loaded window).
 ** {*}Metrics{*}: Measure the additional time incurred by querying instants 
that require fetching data from the archive. Compare the performance of lazy 
loading vs. eagerly loaded instants.
 ** {*}Goal{*}: Ensure that lazy loading does not introduce significant latency 
for older queries.

----
h4. *2. Long-Running Stress Testing*
*2.1 Long-Running Test Overview* * {*}Goal{*}: Simulate long-running workloads 
to test the reliability and stability of the LSM timeline under sustained 
pressure. This includes verifying that the compaction, cleaning, and manifest 
update processes run efficiently over time, without memory leaks or system 
degradation.

*2.2 Stress Test Scenarios* * {*}Test Setup{*}:
 ** Use a *cloud-based setup* to simulate continuous commit operations over an 
extended period (24-48 hours).
 ** Simulate both *heavy write operations* (e.g., frequent commits) and *mixed 
workloads* (read/write/query).
 ** Ensure the environment is monitored for resource usage (CPU, memory, disk 
space).
 * {*}Stress Test Metrics{*}:
 ** {*}Memory Usage{*}: Ensure that memory usage remains stable over time and 
that there are no memory leaks.
 ** {*}Disk Space Usage{*}: Monitor disk space usage to ensure that cleaning 
and compaction effectively manage the file system.
 ** {*}Compaction Performance{*}: Measure how compaction behaves under 
continuous writes. Ensure that the system compacts efficiently without 
bottlenecks or excessive latency.
 ** {*}Snapshot Version Maintenance{*}: Verify that only the correct number of 
snapshot versions is kept (as per the retention settings).
 ** {*}Timeline Consistency{*}: Ensure that the LSM timeline remains consistent 
across all layers and that no data corruption occurs during long-running 
operations.

*2.3 Stress Test Execution* * {*}Test 1: Continuous Commit Test{*}:
 ** {*}Scenario{*}: Simulate *continuous commit operations* (e.g., committing 
1M+ instants over 24 hours) to verify how the LSM timeline handles sustained 
writes.
 ** {*}Metrics{*}: Track the memory usage, compaction performance, and disk 
space usage. Ensure that the system cleans up old files efficiently and that 
compaction runs frequently enough to prevent excessive file growth.
 ** {*}Goal{*}: Ensure that the system can handle long-term continuous commits 
without performance degradation or resource exhaustion.
 * {*}Test 2: Mixed Workload Test{*}:
 ** {*}Scenario{*}: Simulate a *mixed workload* of reads, writes, and queries 
over a 24-hour period. Include both small and large commit operations to test 
how the system handles varied workloads.
 ** {*}Metrics{*}: Monitor query latency, write performance, and compaction 
efficiency over time. Ensure that the system maintains consistent performance 
across all operations.
 ** {*}Goal{*}: Validate that the LSM timeline can support mixed workloads 
without becoming a bottleneck.
 * {*}Test 3: Snapshot and Cleanup Stability Test{*}:
 ** {*}Scenario{*}: Run a test that involves frequently writing to the timeline 
and verifying that snapshots are correctly maintained and cleaned up. Trigger 
regular compaction and cleaning operations.
 ** {*}Metrics{*}: Verify that only the last N snapshot versions are kept, and 
that old snapshots are properly cleaned up. Ensure that manifest files are 
updated correctly and that no obsolete files remain in the system.
 ** {*}Goal{*}: Ensure that the system maintains the correct number of 
snapshots and that cleanup runs efficiently without leaving stale data behind.

*2.4 Stress Test Monitoring* * {*}Resource Monitoring{*}:
 ** Set up *resource monitoring* for CPU, memory, and disk space to track how 
the system behaves under long-running operations.
 ** Use cloud monitoring tools (e.g., AWS CloudWatch, GCP Monitoring) to track 
API usage and response times.
 * {*}Failure Handling{*}:
 ** Simulate *failure scenarios* (e.g., crashing a writer) to test how the 
system recovers. Verify that the timeline remains consistent and that recovery 
operations (e.g., rollback) work as expected.
 ** Test for *concurrency issues* by simulating multiple concurrent writers to 
ensure that timeline isolation is maintained.

----
h4. *3. Maintenance & Reliability Testing*
*3.1 Maintenance Test Overview* * {*}Goal{*}: Ensure that the LSM timeline is 
well-maintained over long periods, with no resource exhaustion or memory leaks. 
Validate that compaction, cleaning, and manifest updates run at appropriate 
intervals and do not cause performance issues.

*3.2 Maintenance Tests* * {*}Test 1: Memory Leak Detection{*}:
 ** {*}Scenario{*}: Run a *long-running test* (24 hours) with continuous writes 
and compactions to verify if the system leaks memory over time.
 ** {*}Metrics{*}: Track memory usage throughout the test to detect any signs 
of memory growth. Ensure that the system's memory footprint remains stable.
 ** {*}Goal{*}: Verify that the system does not leak memory and remains stable 
over long durations.
 * {*}Test 2: Compaction & Cleanup Efficiency{*}:
 ** {*}Scenario{*}: Ensure that compaction and cleanup operations run 
efficiently under sustained load. Test the system’s ability to compact small 
files and clean up old files while maintaining performance.
 ** {*}Metrics{*}: Monitor the time taken for compaction and cleaning 
operations. Ensure that these processes do not cause performance degradation 
during high load.
 ** {*}Goal{*}: Validate that compaction and cleaning are efficient and that 
the system remains performant over time.
 * {*}Test 3: Manifest & Version File Consistency{*}:
 ** {*}Scenario{*}: Run long-term tests that involve frequent updates to the 
manifest and version files to ensure their consistency.
 ** {*}Metrics{*}: Track the correctness of manifest updates and verify that 
version files are updated correctly after compaction and cleaning.
 ** {*}Goal{*}: Ensure that manifest files and versioning remain consistent 
across long-running operations.

> Harden, Stress and Performance test the LSM timeline on cloud storage
> ---------------------------------------------------------------------
>
>                 Key: HUDI-7544
>                 URL: https://issues.apache.org/jira/browse/HUDI-7544
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Vinoth Chandar
>            Assignee: Sagar Sumit
>            Priority: Blocker
>             Fix For: 1.0.0
>
>   Original Estimate: 16h
>          Time Spent: 3h
>  Remaining Estimate: 13h
>
> First, we need summarize the access patterns to the LSM timeline 
>  * Who reads/writes from/to , at what frequency (i.e once per query, once per 
> table service x, or multiple times in a commit etc..) 
>  * Understand defaults that control performance (e.g completiontime queryview 
> loading last 7 days or lsm timeline or sth.. )
>  * Flag any issues that can cause correctness issues for writes/queries based 
> optimizations done/design.. 
>  * Finally with the same/updated benchmark, run a large LSM timeline and 
> ensure its performance and efficient (in terms of cloud API calls)..
>  * Ensure LSM is well-maintained (compaction, ... etc runs at right 
> frequency)  with a long running test and ensure it does memory leak etc. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-7544) Harden, Stress and Performance test the LSM timeline on cloud storage

Reply via email to