liujiwen-up opened a new pull request, #660:
URL: https://github.com/apache/doris-flink-connector/pull/660

   # Proposed changes
   
   Issue Number: close #xxx
   
   ## Problem Summary:
   
   This PR changes Doris Flink Connector `INSERT OVERWRITE` sink behavior to 
avoid truncating the target table before data is successfully written.
   
   Previously, `INSERT OVERWRITE` executed `TRUNCATE TABLE` before writing. If 
the Flink job failed after truncate but before the write completed, the target 
table could be left empty.
   
   The new implementation uses a staging-table based flow:
   
   1. Create a staging table with `CREATE TABLE staging LIKE target`.
   2. Write data into the staging table through Stream Load 2PC.
   3. After all committed data is available, finalize the overwrite with:
      `ALTER TABLE target REPLACE WITH TABLE staging 
PROPERTIES('swap'='false')`.
   
   This implementation is currently limited to bounded `INSERT OVERWRITE` with 
`STREAM_LOAD` and 2PC enabled. It rejects unsafe configurations such as 
streaming overwrite, non-Stream Load write modes, 
`sink.ignore.commit-error=true`, missing `sink.label-prefix`, missing 
`jdbc-url`, and pre-existing staging tables.
   
   Additional guards were added to require Doris table id metadata from 
`information_schema.metadata_name_ids`, so the connector can detect 
target-table changes before finalization and identify already-finalized 
overwrite attempts.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: Yes
   2. Has unit tests been added: Yes
   3. Has document been added or modified: No Need
   4. Does it need to update dependencies: No
   5. Are there any changes that cannot be rolled back: No
   
   ## Further comments
   
   This change intentionally chooses a conservative first version:
   
   - Only bounded overwrite is supported.
   - Only Stream Load 2PC is supported.
   - Existing staging tables are not reused to avoid publishing stale or mixed 
data.
   - Table id metadata is required instead of silently degrading safety checks.
   
   Alternatives considered include continuing to use `TRUNCATE TABLE`, reusing 
existing staging tables during recovery, or supporting more write modes 
immediately. These were not chosen because they either preserve the original 
data-loss risk or make failure/recovery semantics harder to prove safe.
   
   Validation performed:
   
   ```bash
   mvn -Pflink1 -pl flink-doris-connector-base -Dtest=TestDorisOverwriteManager 
test
   mvn -Pflink1 -pl flink-doris-connector-flink1 -am -DskipTests compile
   mvn -Pflink2 -pl flink-doris-connector-flink2 -am -DskipTests compile
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to