[I] [spark] Enhance COPY INTO with multi-format support and advanced options [paimon]

via GitHub Wed, 27 May 2026 20:42:22 -0700


JunRuiLee opened a new issue, #8005:
URL: https://github.com/apache/paimon/issues/8005


   ## Motivation
   
   In production environments, bulk data import and export is a common and 
critical operation — loading datasets from external storage into Paimon tables, 
or exporting table data to files for downstream consumption. The `COPY INTO` 
statement provides a declarative SQL interface for these operations without 
requiring users to write custom ETL pipelines.
   
   The initial `COPY INTO` implementation was recently introduced with basic 
CSV support. This umbrella issue tracks the effort to extend it with 
multi-format support and advanced options, aligning with the commonly adopted 
capabilities in the industry.
   
   ## Planned Features
   
   We plan to implement the following 7 capabilities, which are the most 
commonly adopted features in the industry, as separate PRs:
   
   ### Format Support
   - [ ] **JSON format** — Support `FILE_FORMAT = (TYPE = JSON)` for both 
import and export. Options: `MULTI_LINE`, `COMPRESSION`, `NULL_IF`, 
`EMPTY_FIELD_AS_NULL`. JSON import uses column-name matching by default (no 
positional dependency).
   - [ ] **Parquet format** — Support `FILE_FORMAT = (TYPE = PARQUET)` for both 
import and export. Leverages Spark's native schema inference. Options: 
`COMPRESSION`.
   
   ### Error Handling
   - [ ] **ON_ERROR = CONTINUE / SKIP_FILE** — Currently only `ABORT_STATEMENT` 
is supported. `CONTINUE` skips bad rows and reports error counts per file. 
`SKIP_FILE` skips entire files that fail and reports per-file 
LOADED/LOAD_FAILED status.
   
   ### Column Matching
   - [ ] **MATCH_BY_COLUMN_NAME** — Support `NONE` (default, positional), 
`CASE_SENSITIVE`, and `CASE_INSENSITIVE` modes. Only applicable to structured 
formats (JSON/Parquet), not CSV. Mutually exclusive with explicit column lists.
   
   ### Export Enhancement
   - [ ] **FROM (SELECT ...) query export** — Currently export only supports 
`FROM table_name`. This adds support for arbitrary SQL queries as the data 
source for `COPY INTO <location>`.
   
   ### Load Management
   - [ ] **PURGE** — When `PURGE = TRUE`, automatically delete source files 
after successful loading. Best-effort deletion (failures are silently ignored). 
In `SKIP_FILE` mode, only successfully loaded files are purged.
   - [ ] **VALIDATION_MODE** — Validate data without actually loading it. 
Supports `RETURN_<n>_ROWS` (preview n rows), `RETURN_ERRORS` (show first error 
per file), and `RETURN_ALL_ERRORS` (show all errors across all files).
   
   ## Proposed SQL Syntax
   
   ### Import
   ```sql
   COPY INTO table_name [(col1, col2, ...)]
   FROM 'source_path'
   FILE_FORMAT = (TYPE = CSV | JSON | PARQUET [, option = value, ...])
   [PATTERN = 'regex']
   [FORCE = TRUE | FALSE]
   [ON_ERROR = ABORT_STATEMENT | CONTINUE | SKIP_FILE]
   [MATCH_BY_COLUMN_NAME = NONE | CASE_SENSITIVE | CASE_INSENSITIVE]
   [PURGE = TRUE | FALSE]
   [VALIDATION_MODE = RETURN_n_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS]
   ```
   
   ### Export
   ```sql
   COPY INTO 'target_path'
   FROM table_name | ('SELECT ...')
   FILE_FORMAT = (TYPE = CSV | JSON | PARQUET [, option = value, ...])
   [OVERWRITE = TRUE | FALSE]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [spark] Enhance COPY INTO with multi-format support and advanced options [paimon]

Reply via email to