dybyte opened a new pull request, #10399:
URL: https://github.com/apache/seatunnel/pull/10399

   Fixes: https://github.com/apache/seatunnel/issues/10329
   ### Purpose of this pull request
   
   This PR introduces LSM-style write and compaction logic for IMap external 
storage, ensuring
   sorted immutable files, automatic recovery from temporary files, and 
size-based compaction.
   
   <img width="623" height="600" alt="screenshot" 
src="https://github.com/user-attachments/assets/58c30bff-0b67-4b96-835a-d7bb8bdc479b";
 />
   
   
   ### Write & Compaction Workflow
   
   1. During `write`, data is first appended to a temporary file in an unsorted 
format.
   2. When the size of the temporary file exceeds `blockSize`, a `sortFlush` is 
triggered:
       - Records in the temporary file are sorted by key.
       - The sorted data is written into a new immutable data file.
       - The temporary file is then deleted.
   3. The system tracks `totalBytes` of persisted files. When `totalBytes` 
exceeds `compactionThreshold`, compaction is triggered.
   4. During compaction, smaller files are merged first to reduce file 
fragmentation, and only the latest record per key is retained.
   5. On `initialize`, if any temporary files are detected (for example, due to 
an unexpected shutdown), `sortFlush` is executed to ensure:
       - All remaining temporary data is flushed into sorted files.
       - No temporary files are left behind.
   
   As a result, only sorted immutable files remain before normal operation 
resumes.
   
   ```
   Write
     ↓
   Temp File
     ↓ (blockSize)
   SortFlush
     ↓
   Data File
     ↓ (totalBytes > threshold)
   Compaction
   
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Please refer to docs.
   
   ### How was this patch tested?
   
   - `LSMWriterTest`(parameterized for both Cloud and HDFS LSMWriters)
   - `IMapFileStorageTest`
   
   ### Check list
   
   * [ ] If any new Jar binary package adding in your PR, please add License 
Notice according
     [New License 
Guide](https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md)
   * [x] If necessary, please update the documentation to describe the new 
feature. https://github.com/apache/seatunnel/tree/dev/docs
   * [ ] If necessary, please update `incompatible-changes.md` to describe the 
incompatibility caused by this PR.
   * [ ] If you are contributing the connector code, please check that the 
following files are updated:
     1. Update 
[plugin-mapping.properties](https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties)
 and add new connector information in it
     2. Update the pom file of 
[seatunnel-dist](https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml)
     3. Add ci label in 
[label-scope-conf](https://github.com/apache/seatunnel/blob/dev/.github/workflows/labeler/label-scope-conf.yml)
     4. Add e2e testcase in 
[seatunnel-e2e](https://github.com/apache/seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/)
     5. Update connector 
[plugin_config](https://github.com/apache/seatunnel/blob/dev/config/plugin_config)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to