[PR] [spark] Support Parquet format in COPY INTO [paimon]

via GitHub Fri, 29 May 2026 18:33:34 -0700


JunRuiLee opened a new pull request, #8037:
URL: https://github.com/apache/paimon/pull/8037


   This PR adds Parquet format support for COPY INTO import and export, as part 
of #8005.
   
   ## Changes
   
   **Import** (`COPY INTO table FROM path`):
   - Read Parquet files with native typed schema (no string-then-cast like 
CSV/JSON)
   - Column matching by name (case-insensitive), not by position
   - Extra source columns are ignored; missing columns become NULL
   - Cast validation: detects non-null → null after casting (type 
incompatibility)
   - Supports explicit column list, PATTERN, FORCE, ON_ERROR = ABORT_STATEMENT
   
   **Export** (`COPY INTO path FROM table`):
   - Write Parquet files via `df.write.parquet()`
   - COMPRESSION option (SNAPPY, GZIP, NONE, etc.)
   
   **Refactoring**:
   - Extract `resolveDefaultColumn()` shared helper (was duplicated in Parquet 
and text paths)
   - Unify `recordHistoryAndBuildResults()` to accept a `countDf` parameter 
(eliminates ~45 lines of copy-paste between Parquet and text paths)
   - Add `logWarning` when default value expression parsing fails (was silently 
swallowed)
   
   ## Tests
   
   12 new tests covering: basic import, column name matching, explicit column 
list, export, export with compression, round-trip, extra fields ignored, 
missing fields become null, FORCE=FALSE dedup, PATTERN filtering, unsupported 
option error, rows_loaded count accuracy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [spark] Support Parquet format in COPY INTO [paimon]

Reply via email to