SteNicholas opened a new pull request, #7850:
URL: https://github.com/apache/paimon/pull/7850

   ### Purpose
   
   **Problem**: For tables with BLOB columns, pypaimon uses `DataBlobWriter`, 
which splits each `pyarrow.RecordBatch` into “normal” columns (written to 
Parquet/ORC/…) and blob-file columns (written via `BlobWriter`). `_split_data` 
used the full table lists of normal and blob-file column names when calling 
RecordBatch.select(...).
   
   **Regression**: When `TableWrite.with_write_type(...)` narrows the write to 
a partial column list, validation ensures incoming batches only contain those 
`columns. _split_data` still tried to select columns not present in the batch 
(e.g. a normal column omitted from the partial write), which caused PyArrow to 
raise KeyError.
   
   **Fix**:
   
   - Pass `write_cols` from `FileStoreWrite` into `DataBlobWriter` (same as for 
`AppendOnlyDataWriter`), so the blob writer sees the narrowed column set from 
with_write_type.
   - In `DataBlobWriter.__init__`, derive `normal_column_names` and 
`blob_file_column_names` from that subset when `write_cols` is set: only 
blob-file columns that appear in write_cols, and normal columns = write_cols 
minus blob-file columns (order preserved from `write_cols`). Only instantiate 
`BlobWriter` for blob-file columns in that narrowed set.
   - Keep full-table behavior when write_cols is None (full schema write).
   - This keeps `_split_data` consistent with the actual batch schema and 
matches the intent of partial / data-evolution writes.
   
   ### Tests
   
   Added in `paimon-python/pypaimon/tests/blob_table_test.py` 
(`DataBlobWriterTest`):
   
   1. Partial normal + one blob — `with_write_type(['id', 'blob_data'])` with a 
batch that only has those columns; asserts one Parquet file (`write_cols == 
['id']`), blob file(s) (write_cols == ['blob_data']), row counts, commit + 
read-back (unwritten name is null).
   2. Partial normal only — `with_write_type(['id', 'name'])` without blob 
columns in the batch; asserts no .blob files, Parquet write_cols == ['id', 
'name'], read-back with blob column null.
   3. Two blob columns, write one — schema with blob1 and blob2, 
`with_write_type(['id', 'blob1'])`; asserts exactly one blob file and 
write_cols == ['blob1'].


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to