[PR] [spark] Add COPY INTO support for CSV import and file writing [paimon]

via GitHub Thu, 21 May 2026 05:36:27 -0700


JunRuiLee opened a new pull request, #7926:
URL: https://github.com/apache/paimon/pull/7926


   ## What is changed
   
   This PR adds Spark SQL `COPY INTO` support for bulk CSV import and CSV file 
writing.
   
   Supported import syntax:
   
   ```sql
   COPY INTO table_name [(col1, col2, ...)]
   FROM 'source_path'
   FILE_FORMAT = (TYPE = CSV [, option = value, ...])
   [PATTERN = 'regex']
   [FORCE = TRUE|FALSE]
   [ON_ERROR = ABORT_STATEMENT]
   ```
   
   Supported file writing syntax:
   
   ```sql
   COPY INTO 'target_path'
   FROM table_name
   FILE_FORMAT = (TYPE = CSV [, option = value, ...])
   [OVERWRITE = TRUE|FALSE]
   ```
   
   ## Main features
   
   - Add parser, logical plans, and Spark execution for `COPY INTO`.
   - Support CSV import into Paimon tables.
   - Support CSV file writing from Paimon tables.
   - Support structured `FILE_FORMAT = (...)` options.
   - Support explicit import column lists with positional mapping.
   - Fill omitted columns with table default values or `NULL`.
   - Support `PATTERN` filtering by source file base name.
   - Support `FORCE` for controlling repeated imports.
   - Return observable command results for both import and file writing.
   - Add user documentation and Spark tests.
   
   ## CSV import options
   
   Supported import `FILE_FORMAT` options:
   
   | Option | Description |
   | --- | --- |
   | `TYPE = CSV` | CSV file format. |
   | `FIELD_DELIMITER` | Column delimiter character. |
   | `SKIP_HEADER` | Skip the first line as header. Only `0` or `1` is 
supported. |
   | `QUOTE` | Quote character for enclosing fields. |
   | `ESCAPE` | Escape character within quoted fields. |
   | `NULL_IF` | Values to interpret as `NULL`. |
   | `EMPTY_FIELD_AS_NULL` | Treat empty fields as `NULL`. |
   | `COMPRESSION` | Compression codec. |
   
   `COPY INTO` reads CSV input with `FAILFAST` behavior. `ON_ERROR = 
ABORT_STATEMENT` is the only supported error handling mode.
   
   ## CSV file writing options
   
   Supported file writing `FILE_FORMAT` options:
   
   | Option | Description |
   | --- | --- |
   | `TYPE = CSV` | CSV file format. |
   | `FIELD_DELIMITER` | Column delimiter character. |
   | `HEADER` | Write column names as the first line. |
   | `QUOTE` | Quote character for enclosing fields. |
   | `ESCAPE` | Escape character within quoted fields. |
   | `COMPRESSION` | Compression codec. |
   
   `OVERWRITE = FALSE` fails if the target path already exists. `OVERWRITE = 
TRUE` overwrites the target path.
   
   ## Repeated imports
   
   For table imports, `COPY INTO` records successfully loaded source files and 
skips them by default.
   
   A source file is identified by:
   
   | Field | Description |
   | --- | --- |
   | `file path` | Full source file path. |
   | `file size` | Source file size. |
   | `last modified timestamp` | Source file last-modified timestamp. |
   
   With `FORCE = FALSE`, already loaded files are skipped and returned with 
status `SKIPPED`.
   
   With `FORCE = TRUE`, matching source files are loaded again.
   
   The load history is written after the table write succeeds. This provides 
best-effort protection against duplicate imports, but it is not a strict 
exactly-once guarantee. If the table commit succeeds but writing load history 
fails, a later retry may load the same files again. Concurrent `COPY INTO` 
commands targeting the same files may also produce duplicate data.
   
   ## Result output
   
   Import returns one row per source file:
   
   | Column | Type | Description |
   | --- | --- | --- |
   | `file_name` | `STRING` | Source file name. |
   | `status` | `STRING` | `LOADED` or `SKIPPED`. |
   | `rows_loaded` | `BIGINT` | Number of rows written. |
   | `rows_parsed` | `BIGINT` | Number of rows parsed from the file. |
   
   File writing returns one row:
   
   | Column | Type | Description |
   | --- | --- | --- |
   | `output_path` | `STRING` | Target output path. |
   | `file_count` | `INT` | Number of files written. |
   | `rows_written` | `BIGINT` | Total rows written. |
   
   ## Limitations
   
   - Only CSV format is supported.
   - File writing only supports `FROM table_name`; query source is not 
supported.
   - `ON_ERROR = CONTINUE` is not supported.
   - `SINGLE = TRUE` is not supported.
   - File format options must be specified inline in `FILE_FORMAT = (...)`.
   - Import file listing is non-recursive.
   - `PATTERN` matches only the source file base name.
   - `SKIP_HEADER` supports only `0` or `1`.
   
   ## Tests
   
   Added Spark SQL tests for:
   
   - CSV import
   - CSV import options
   - explicit column mapping
   - default value filling
   - malformed CSV failures
   - cast failure handling
   - repeated import behavior with `FORCE`
   - CSV file writing
   - overwrite behavior
   - option validation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [spark] Add COPY INTO support for CSV import and file writing [paimon]

Reply via email to