yzeng1618 commented on PR #10266: URL: https://github.com/apache/seatunnel/pull/10266#issuecomment-3712983852
> > `DataSaveMode` > > +1, @yzeng1618, Could you explain the difference between `file_exists_mode`, `DataSaveMode`, and `SchemaSaveMode`? Does the combined functionality of `DataSaveMode` and `SchemaSaveMode` include `file_exists_mode`? > > `OVERWRITE` (default) / `SKIP` / `FAIL` > > Why should we do this in the submission phase? I think what `OVERWRITE (default) / SKIP / FAIL` does is consistent with the final behavior of `DataSaveMode`. The reason file_exists_mode is placed at the commit phase (2PC rename/move of temporary files to the final path) is that only at this phase will the "temporary files" be finalized as "final filenames" in the target directory. Therefore, name conflicts occur at the file level, and deterministic decisions of OVERWRITE/SKIP/FAIL can only be made during the rename operation. Here is an example to illustrate: Scenario Initialization Source directory: /tmp/source contains test1.txt and test2.txt. Target directory: /tmp/target already has an existing test1.txt (old file), while test2.txt does not exist. Configuration: path=/tmp/target (the write destination directory). Writes are first persisted to tmp_path, then renamed to /tmp/target/... during commit. 1. Observing only data_save_mode (Takes effect before task starts, directory-level) - DROP_DATA: Clear/recreate /tmp/target before task startup (the old test1.txt will be deleted) → No conflicts occur when writing test1.txt and test2.txt during commit → Result: The target directory contains the new test1.txt + test2.txt. - APPEND_DATA: Do not modify /tmp/target before task startup (the old test1.txt remains) → A conflict "the test1.txt to be written already exists" will be encountered during commit, but APPEND_DATA itself does not define how to handle single-file name conflicts → The decision to overwrite/skip/fail depends on file_exists_mode. - ERROR_WHEN_DATA_EXISTS: Check /tmp/target before task startup; fail if any data files exist (there is currently test1.txt) → Fail directly without proceeding to the write/commit phase. 2. Observing only file_exists_mode (Takes effect during commit, file-level; assuming data_save_mode=APPEND_DATA) - OVERWRITE: When renaming test1.txt during commit and detecting the existing old file → Delete the old test1.txt first, then rename the temporary file to overwrite it; test2.txt is committed normally → Result: The target directory contains the new test1.txt + test2.txt. - SKIP: When detecting the existing test1.txt during commit → Retain the old test1.txt, delete the temporary test1.txt, and mark the commit as successful; test2.txt is committed normally → Result: The target directory contains the old test1.txt + new test2.txt. - FAIL: When detecting the existing test1.txt during commit → Throw an error and fail immediately (used to explicitly prevent overwrites). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
