SteNicholas opened a new pull request, #7850: URL: https://github.com/apache/paimon/pull/7850
### Purpose **Problem**: For tables with BLOB columns, pypaimon uses `DataBlobWriter`, which splits each `pyarrow.RecordBatch` into “normal” columns (written to Parquet/ORC/…) and blob-file columns (written via `BlobWriter`). `_split_data` used the full table lists of normal and blob-file column names when calling RecordBatch.select(...). **Regression**: When `TableWrite.with_write_type(...)` narrows the write to a partial column list, validation ensures incoming batches only contain those `columns. _split_data` still tried to select columns not present in the batch (e.g. a normal column omitted from the partial write), which caused PyArrow to raise KeyError. **Fix**: - Pass `write_cols` from `FileStoreWrite` into `DataBlobWriter` (same as for `AppendOnlyDataWriter`), so the blob writer sees the narrowed column set from with_write_type. - In `DataBlobWriter.__init__`, derive `normal_column_names` and `blob_file_column_names` from that subset when `write_cols` is set: only blob-file columns that appear in write_cols, and normal columns = write_cols minus blob-file columns (order preserved from `write_cols`). Only instantiate `BlobWriter` for blob-file columns in that narrowed set. - Keep full-table behavior when write_cols is None (full schema write). - This keeps `_split_data` consistent with the actual batch schema and matches the intent of partial / data-evolution writes. ### Tests Added in `paimon-python/pypaimon/tests/blob_table_test.py` (`DataBlobWriterTest`): 1. Partial normal + one blob — `with_write_type(['id', 'blob_data'])` with a batch that only has those columns; asserts one Parquet file (`write_cols == ['id']`), blob file(s) (write_cols == ['blob_data']), row counts, commit + read-back (unwritten name is null). 2. Partial normal only — `with_write_type(['id', 'name'])` without blob columns in the batch; asserts no .blob files, Parquet write_cols == ['id', 'name'], read-back with blob column null. 3. Two blob columns, write one — schema with blob1 and blob2, `with_write_type(['id', 'blob1'])`; asserts exactly one blob file and write_cols == ['blob1']. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
