JunRuiLee opened a new pull request, #8062:
URL: https://github.com/apache/paimon/pull/8062
## Motivation
COPY INTO previously only supported `ON_ERROR = ABORT_STATEMENT`: any parse
or
cast error aborted the entire command. In production data-loading pipelines a
single malformed row or file would then fail the whole batch, which is often
too strict. This adds two error-tolerant modes:
- `CONTINUE` — skip bad rows and load the rest (row-level tolerance).
- `SKIP_FILE` — skip any file that contains an error, all-or-nothing per
file.
`ABORT_STATEMENT` remains the default, so existing behavior is unchanged.
## Changes
- Grammar: `ON_ERROR` now accepts `CONTINUE` and `SKIP_FILE` in addition to
`ABORT_STATEMENT`.
- Result schema gains two columns:
- `errors_seen` (BIGINT) — number of error rows per file.
- `first_error` (STRING) — first error message, NULL when the file is
clean.
- `status` now also reports `PARTIALLY_LOADED` and `LOAD_FAILED`.
- Error detection runs once per batch; both modes write in a single commit.
Load history is recorded so error-tolerant runs stay idempotent under
`FORCE = FALSE`.
- Refactor: `CopyIntoTableExec` is split into focused helpers
(`CopyIntoHelper`, `CopyIntoCastValidator`, `CopyIntoDataFrameBuilder`,
`CopyIntoErrorHandler`, `CopyIntoResultBuilder`), shared across
CSV/JSON/Parquet.
- Docs updated in `sql-write.md`, including the CSV column-count-mismatch
caveat
under `CONTINUE`.
Supported for CSV, JSON, and Parquet.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]