[
https://issues.apache.org/jira/browse/HUDI-7544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887740#comment-17887740
]
Sagar Sumit commented on HUDI-7544:
-----------------------------------
h3. Relevant Code
{{LSMTimeline}} - Lays out the LSM tree layout of the timeline. * Layout
Overview
** *LSM Tree Structure:*
*** {*}Layer 0 (L0){*}: The most recent instants (timeline actions) are stored
here in smaller files, one batch at a time.
*** {*}Higher Layers (L1, L2, etc.){*}: When the number of files in L0 exceeds
a threshold (default is 10), a compaction process merges these files into
larger files in L1, and the process repeats for subsequent layers.
*** {*}Manifests{*}: The timeline uses *manifest files* to track the valid
files in each snapshot version. When a new batch is flushed into L0 or a
compaction happens, a new manifest is generated, recording the current valid
files for that snapshot.
*** {*}Version Files{*}: After each manifest file is generated, a *version
file* is created to track the version of the snapshot. Each new snapshot is a
new version of the timeline.
** {*}LSM Timeline Compaction{*}:
*** The LSM tree uses a *universal compaction strategy* where files in L0 are
compacted into higher levels (e.g., L1, L2) when a threshold number of files is
reached. This helps in managing the size of the timeline effectively while
supporting efficient reads.
*** Example: If each file in L0 contains 10 instants, then after compaction,
each file in L1 could contain 100 instants, and each file in L2 could represent
1000 instants, thereby scaling well for large timelines.
** {*}Reader Workflow{*}:
*** The LSM timeline supports *snapshot isolation* for readers and writers. A
reader can read the latest version of the timeline by:
**** Reading the latest *version file* to determine the current snapshot.
**** Accessing the corresponding *manifest file* for the snapshot to get the
list of valid data files.
**** Reading the data from the *parquet files* using optimizations like data
skipping based on timestamp ranges in file names.
** *Clean Strategy:*
*** Implements a cleaning strategy to retain only a limited number of snapshot
versions (default is 3). After a valid compaction, older files are removed, but
at least 3 versions of the snapshot are kept to ensure recovery and rollback
options are available.
*** There is also an *Instant TTL* mechanism to discard instants that are
older than a certain threshold (based on days), ensuring that very old data
does not unnecessarily burden the system.
{{LSMTimelineWriter}} - Organizes the timeline files as an LSM tree as
specified in LSMTimeline.
{{CompletionTimeQueryView}} - Maps start time to completion time for completion
time based file slicing. This might involve lazily loading the LSM timeline.
h3. *Write Access Patterns*
* {*}Commit Writes{*}:
** {*}Frequency{*}: Multiple times during a commit. Each commit operation may
trigger archival of older eligible commit.
** {*}Components{*}: {{LSMTimelineWriter}} (used in
{{{}HoodieTimelineArchiver{}}}) writes new instants into the active timeline
and performs compaction based on thresholds.
** {*}Use Cases{*}: New commits, updates, compaction to higher layers, and
clean-up of old timeline files.
* {*}Compaction Writes{*}:
** {*}Frequency{*}: Once per layer when enough files are accumulated in L0.
** {*}Components{*}: {{LSMTimelineWriter}} compacts timeline files into higher
layers when the number of files in L0 exceeds the threshold.
** {*}Use Cases{*}: Maintaining performance by reducing the number of small
files and improving read efficiency.
* {*}Manifest File Writes{*}:
** {*}Frequency{*}: On every flush to L0 or after compaction to L1 or higher.
** {*}Components{*}: Manifest files keep track of valid data files and
snapshot versions.
** {*}Use Cases{*}: Snapshot isolation, ensuring consistency of timeline
versions.
h3. *Read Access Patterns*
* {*}Query-Related Reads{*}:
** {*}Frequency{*}: Once per query.
** {*}Components{*}: {{CompletionTimeQueryView}} queries the timeline to map
start times to completion times and check if instants are archived or completed.
** {*}Use Cases{*}: Fetching commit information, ensuring that the correct
timeline view is available.
* {*}Archival Reads{*}:
** {*}Frequency{*}: During compaction and query retrieval.
** {*}Components{*}: {{{}HoodieArchivedTimeline{}}}, fetching older versions
from the archived timeline.
** {*}Use Cases{*}: Data recovery, rollback, time travel, and historical
snapshots.
h3. *Defaults Affecting Performance*
* {{{}config.getTimelineCompactionBatchSize(){}}}:
** This controls the number of files that trigger compaction in L0. Setting
this value too low can result in excessive compaction; setting it too high
might result in too many small files.
* {{{}max file size{}}}:
** The file size limit ({{{}MAX_FILE_SIZE_IN_BYTES{}}}) affects the frequency
of file flushes. The default is set to 1GB.
* {*}Completion Time Query{*}:
** {*}Default{*}: Queries start with an eager load for the last 3 days of
completed instants.
** {*}Effect{*}: By restricting the load to a recent window, query performance
is improved. A query for older instants will lazily load the required data from
the archive.
* {*}Compaction and Cleaning Frequency{*}:
** LSM compaction happens when files exceed a threshold (default 10 files),
but cleaning is triggered after compaction. Testing the balance between
compaction and cleaning frequency is important for maintaining performance
without overwhelming resources.
h3. Test Plan
h4. *1. Performance Testing*
*1.1 Performance Benchmark Overview* * {*}Goal{*}: Test the performance of LSM
Timeline under varying workloads and configurations. Benchmark the speed of
writing, reading, and compacting instants, including cases with metadata and
lazy loading from cloud storage.
*1.2 Performance Test Scenarios* * {*}Test Setup{*}:
** Extend the provided benchmark to work with *cloud storage* (e.g., AWS S3,
GCS).
** Simulate different commit sizes (e.g., small vs large commits), as this
will affect the size and frequency of L0 file flushes and compactions.
** Run the test across different datasets (small, medium, and large).
* {*}Performance Metrics{*}:
** {*}Time to Write Instants{*}: Measure how long it takes to write instants
to the LSM timeline. This includes the time to flush instants to L0, compact to
higher layers, and update manifests.
** {*}Query Latency{*}: Measure the time taken to query the timeline for:
*** Slim instants.
*** Instants with commit metadata.
*** Completion time queries.
** {*}Compaction Time{*}: Measure how long it takes to compact files when L0
reaches the threshold, especially for large datasets.
*1.3 Performance Benchmark Tests* * {*}Test 1: Basic Read/Write Performance{*}:
** {*}Scenario{*}: Run the benchmark with default settings on a {*}local
filesystem{*}.
** {*}Metrics{*}: Measure the time to write and read back instants, including
reading commit metadata and start times.
** {*}Goal{*}: Establish a baseline for the performance of LSM timeline on
local storage.
* {*}Test 2: Cloud Storage Performance{*}:
** {*}Scenario{*}: Extend the benchmark to run on *cloud storage* (e.g., S3 or
GCS). Benchmark the time taken to write and read instants from cloud storage,
considering network latency and API calls.
** {*}Metrics{*}: Measure the same metrics as Test 1, but also track the
*number of cloud API calls* made for reading/writing data.
** {*}Goal{*}: Verify that LSM timeline performs efficiently in cloud
environments and that the number of cloud API calls is minimized.
* {*}Test 3: Large Dataset Performance{*}:
** {*}Scenario{*}: Run the benchmark on a *large dataset* (e.g., 10M-50M
commits) with cloud storage, focusing on the performance impact of handling a
large number of instants.
** {*}Metrics{*}: Measure the total time to read/write and compact files. Pay
attention to the memory footprint and cloud API usage.
** {*}Goal{*}: Ensure that the LSM timeline remains efficient and scalable for
large datasets without significant degradation in performance.
* {*}Test 4: Lazy Loading Performance{*}:
** {*}Scenario{*}: Benchmark lazy loading behavior by querying for older
instants that are archived (i.e., beyond the initial eagerly-loaded window).
** {*}Metrics{*}: Measure the additional time incurred by querying instants
that require fetching data from the archive. Compare the performance of lazy
loading vs. eagerly loaded instants.
** {*}Goal{*}: Ensure that lazy loading does not introduce significant latency
for older queries.
----
h4. *2. Long-Running Stress Testing*
*2.1 Long-Running Test Overview* * {*}Goal{*}: Simulate long-running workloads
to test the reliability and stability of the LSM timeline under sustained
pressure. This includes verifying that the compaction, cleaning, and manifest
update processes run efficiently over time, without memory leaks or system
degradation.
*2.2 Stress Test Scenarios* * {*}Test Setup{*}:
** Use a *cloud-based setup* to simulate continuous commit operations over an
extended period (24-48 hours).
** Simulate both *heavy write operations* (e.g., frequent commits) and *mixed
workloads* (read/write/query).
** Ensure the environment is monitored for resource usage (CPU, memory, disk
space).
* {*}Stress Test Metrics{*}:
** {*}Memory Usage{*}: Ensure that memory usage remains stable over time and
that there are no memory leaks.
** {*}Disk Space Usage{*}: Monitor disk space usage to ensure that cleaning
and compaction effectively manage the file system.
** {*}Compaction Performance{*}: Measure how compaction behaves under
continuous writes. Ensure that the system compacts efficiently without
bottlenecks or excessive latency.
** {*}Snapshot Version Maintenance{*}: Verify that only the correct number of
snapshot versions is kept (as per the retention settings).
** {*}Timeline Consistency{*}: Ensure that the LSM timeline remains consistent
across all layers and that no data corruption occurs during long-running
operations.
*2.3 Stress Test Execution* * {*}Test 1: Continuous Commit Test{*}:
** {*}Scenario{*}: Simulate *continuous commit operations* (e.g., committing
1M+ instants over 24 hours) to verify how the LSM timeline handles sustained
writes.
** {*}Metrics{*}: Track the memory usage, compaction performance, and disk
space usage. Ensure that the system cleans up old files efficiently and that
compaction runs frequently enough to prevent excessive file growth.
** {*}Goal{*}: Ensure that the system can handle long-term continuous commits
without performance degradation or resource exhaustion.
* {*}Test 2: Mixed Workload Test{*}:
** {*}Scenario{*}: Simulate a *mixed workload* of reads, writes, and queries
over a 24-hour period. Include both small and large commit operations to test
how the system handles varied workloads.
** {*}Metrics{*}: Monitor query latency, write performance, and compaction
efficiency over time. Ensure that the system maintains consistent performance
across all operations.
** {*}Goal{*}: Validate that the LSM timeline can support mixed workloads
without becoming a bottleneck.
* {*}Test 3: Snapshot and Cleanup Stability Test{*}:
** {*}Scenario{*}: Run a test that involves frequently writing to the timeline
and verifying that snapshots are correctly maintained and cleaned up. Trigger
regular compaction and cleaning operations.
** {*}Metrics{*}: Verify that only the last N snapshot versions are kept, and
that old snapshots are properly cleaned up. Ensure that manifest files are
updated correctly and that no obsolete files remain in the system.
** {*}Goal{*}: Ensure that the system maintains the correct number of
snapshots and that cleanup runs efficiently without leaving stale data behind.
*2.4 Stress Test Monitoring* * {*}Resource Monitoring{*}:
** Set up *resource monitoring* for CPU, memory, and disk space to track how
the system behaves under long-running operations.
** Use cloud monitoring tools (e.g., AWS CloudWatch, GCP Monitoring) to track
API usage and response times.
* {*}Failure Handling{*}:
** Simulate *failure scenarios* (e.g., crashing a writer) to test how the
system recovers. Verify that the timeline remains consistent and that recovery
operations (e.g., rollback) work as expected.
** Test for *concurrency issues* by simulating multiple concurrent writers to
ensure that timeline isolation is maintained.
----
h4. *3. Maintenance & Reliability Testing*
*3.1 Maintenance Test Overview* * {*}Goal{*}: Ensure that the LSM timeline is
well-maintained over long periods, with no resource exhaustion or memory leaks.
Validate that compaction, cleaning, and manifest updates run at appropriate
intervals and do not cause performance issues.
*3.2 Maintenance Tests* * {*}Test 1: Memory Leak Detection{*}:
** {*}Scenario{*}: Run a *long-running test* (24 hours) with continuous writes
and compactions to verify if the system leaks memory over time.
** {*}Metrics{*}: Track memory usage throughout the test to detect any signs
of memory growth. Ensure that the system's memory footprint remains stable.
** {*}Goal{*}: Verify that the system does not leak memory and remains stable
over long durations.
* {*}Test 2: Compaction & Cleanup Efficiency{*}:
** {*}Scenario{*}: Ensure that compaction and cleanup operations run
efficiently under sustained load. Test the system’s ability to compact small
files and clean up old files while maintaining performance.
** {*}Metrics{*}: Monitor the time taken for compaction and cleaning
operations. Ensure that these processes do not cause performance degradation
during high load.
** {*}Goal{*}: Validate that compaction and cleaning are efficient and that
the system remains performant over time.
* {*}Test 3: Manifest & Version File Consistency{*}:
** {*}Scenario{*}: Run long-term tests that involve frequent updates to the
manifest and version files to ensure their consistency.
** {*}Metrics{*}: Track the correctness of manifest updates and verify that
version files are updated correctly after compaction and cleaning.
** {*}Goal{*}: Ensure that manifest files and versioning remain consistent
across long-running operations.
> Harden, Stress and Performance test the LSM timeline on cloud storage
> ---------------------------------------------------------------------
>
> Key: HUDI-7544
> URL: https://issues.apache.org/jira/browse/HUDI-7544
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Vinoth Chandar
> Assignee: Sagar Sumit
> Priority: Blocker
> Fix For: 1.0.0
>
> Original Estimate: 16h
> Time Spent: 3h
> Remaining Estimate: 13h
>
> First, we need summarize the access patterns to the LSM timeline
> * Who reads/writes from/to , at what frequency (i.e once per query, once per
> table service x, or multiple times in a commit etc..)
> * Understand defaults that control performance (e.g completiontime queryview
> loading last 7 days or lsm timeline or sth.. )
> * Flag any issues that can cause correctness issues for writes/queries based
> optimizations done/design..
> * Finally with the same/updated benchmark, run a large LSM timeline and
> ensure its performance and efficient (in terms of cloud API calls)..
> * Ensure LSM is well-maintained (compaction, ... etc runs at right
> frequency) with a long running test and ensure it does memory leak etc.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)