dybyte opened a new pull request, #10399: URL: https://github.com/apache/seatunnel/pull/10399
Fixes: https://github.com/apache/seatunnel/issues/10329 ### Purpose of this pull request This PR introduces LSM-style write and compaction logic for IMap external storage, ensuring sorted immutable files, automatic recovery from temporary files, and size-based compaction. <img width="623" height="600" alt="screenshot" src="https://github.com/user-attachments/assets/58c30bff-0b67-4b96-835a-d7bb8bdc479b" /> ### Write & Compaction Workflow 1. During `write`, data is first appended to a temporary file in an unsorted format. 2. When the size of the temporary file exceeds `blockSize`, a `sortFlush` is triggered: - Records in the temporary file are sorted by key. - The sorted data is written into a new immutable data file. - The temporary file is then deleted. 3. The system tracks `totalBytes` of persisted files. When `totalBytes` exceeds `compactionThreshold`, compaction is triggered. 4. During compaction, smaller files are merged first to reduce file fragmentation, and only the latest record per key is retained. 5. On `initialize`, if any temporary files are detected (for example, due to an unexpected shutdown), `sortFlush` is executed to ensure: - All remaining temporary data is flushed into sorted files. - No temporary files are left behind. As a result, only sorted immutable files remain before normal operation resumes. ``` Write ↓ Temp File ↓ (blockSize) SortFlush ↓ Data File ↓ (totalBytes > threshold) Compaction ``` ### Does this PR introduce _any_ user-facing change? Yes. Please refer to docs. ### How was this patch tested? - `LSMWriterTest`(parameterized for both Cloud and HDFS LSMWriters) - `IMapFileStorageTest` ### Check list * [ ] If any new Jar binary package adding in your PR, please add License Notice according [New License Guide](https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md) * [x] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs * [ ] If necessary, please update `incompatible-changes.md` to describe the incompatibility caused by this PR. * [ ] If you are contributing the connector code, please check that the following files are updated: 1. Update [plugin-mapping.properties](https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties) and add new connector information in it 2. Update the pom file of [seatunnel-dist](https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml) 3. Add ci label in [label-scope-conf](https://github.com/apache/seatunnel/blob/dev/.github/workflows/labeler/label-scope-conf.yml) 4. Add e2e testcase in [seatunnel-e2e](https://github.com/apache/seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/) 5. Update connector [plugin_config](https://github.com/apache/seatunnel/blob/dev/config/plugin_config) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
