XiaoHongbo-Hope opened a new pull request, #6578:
URL: https://github.com/apache/paimon/pull/6578

   [python] Support blob.target-file-size (target-file-size) rolling for blob 
tables
   
   ## Purpose
   
   Currently, Python Paimon blob tables did not support file rolling based on 
`target-file-size` in `blob-as-descriptor=true` mode. 
   
   ## This PR
   
   - **Supports `target-file-size` rolling in `blob-as-descriptor=true` mode**: 
Implements row-by-row writing with actual file size checking, aligned with 
Java's `RollingBlobFileWriter` architecture
   - **Fixes `blob.target-file-size` configuration**: Ensures 
`blob.target-file-size` is respected in both descriptor and non-descriptor 
modes instead of being ignored
   
   ## Tests
   
   - `blob_table_test.py`
   -  `RollingBlobFileWriterTest`, used to check blob rolling in java
   
   <!-- CURSOR_SUMMARY -->
   ---
   
   > [!NOTE]
   > Adds a Python blob writer that rolls by 
`blob.target-file-size`/`target-file-size`, uses `data-{uuid}-{count}.blob` 
naming, and maintains per-row sequence numbers; updates options and tests, and 
aligns Java tests.
   > 
   > - **Python Writer**:
   >   - **`BlobWriter`/`BlobFileWriter`**: New rolling logic for blobs.
   >     - Uses `blob.target-file-size` (fallback to `target-file-size`).
   >     - Writes row-by-row in descriptor mode; batch split in non-descriptor 
mode.
   >     - File naming `data-{uuid}-{count}.blob` with shared UUID across rolls.
   >     - Ensures per-row sequence number increments; accurate min/max seq in 
metadata.
   >     - Stats schema honors actual blob column name.
   > - **Core Options**:
   >   - Add `CoreOptions.TARGET_FILE_SIZE`, `BLOB_TARGET_FILE_SIZE` and 
getters.
   >   - `DataWriter` now derives target size from options.
   > - **Tests (Python)**:
   >   - Add rolling, filename format, sequence number, stats-schema, and 
blob-target-size coverage for both descriptor and non-descriptor modes.
   > - **Tests (Java)**:
   >   - Update `RollingBlobFileWriterTest` to use `data-` prefix and add tests 
for shared-UUID naming, sequence numbers, custom blob column stats, and blob 
target size.
   > 
   > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) 
for commit 6bf3fb4005b1dd349e3a03be1257578b555daa40. This will update 
automatically on new commits. Configure 
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
   <!-- /CURSOR_SUMMARY -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to