leaves12138 commented on PR #7837:
URL: https://github.com/apache/paimon/pull/7837#issuecomment-4447464839

   Thanks for the PR. The overall direction looks right to me: the Python 
`BlobViewStruct` wire format matches the Java side, and the write/read path for 
inline `blob-view-field` plus `blob-as-descriptor=true` is mostly aligned with 
the BlobView design.
   
   I found one blocker around the data-evolution read path though.
   
   `DataFileBatchReader` already resolves inline `blob-view-field` values to 
the actual blob payload when `blob-as-descriptor=false`. In the data-evolution 
path, `DataEvolutionSplitRead` then wraps the merged reader with 
`BlobDescriptorConvertReader`, and that reader scans the already-resolved bytes 
again with `BlobViewStruct.is_blob_view_struct()` / `deserialize()`. This means 
a perfectly valid blob payload whose first bytes happen to match the BlobView 
magic header can be parsed as a `BlobViewStruct` and fail, or even be 
incorrectly resolved as another view.
   
   The problematic chain is:
   
   - `DataFileBatchReader._convert_inline_blob_columns` resolves view fields 
before returning file batches.
   - `DataEvolutionSplitRead.create_reader` wraps the result with 
`BlobDescriptorConvertReader` whenever blob descriptor/view fields exist.
   - `BlobDescriptorConvertReader._convert_batch` attempts to interpret the 
final bytes as `BlobViewStruct` again.
   
   I reproduced this locally by writing a source blob payload starting with the 
BlobView version + magic bytes and then reading it through a downstream 
`blob-view-field`; the default read failed with `ValueError: Invalid 
BlobViewStruct data: too short` even though the source payload is just normal 
blob data.
   
   I think we should ensure inline blob values are converted exactly once. One 
possible fix is to keep conversion in `DataFileBatchReader` and not wrap 
data-evolution readers with `BlobDescriptorConvertReader` for fields that have 
already been converted; alternatively, disable the file-reader-side conversion 
in the data-evolution path and do conversion only after merge/filter.
   
   One more compatibility point: Java validation requires fields listed in 
`blob-descriptor-field` / `blob-view-field` to also be listed in `blob-field`. 
The Python schema validation currently only checks that those names are BLOB 
fields and do not overlap. If this relaxation is intentional because Python 
treats all `large_binary` columns as BLOBs, please document it; otherwise it 
would be better to align the option validation with Java to avoid creating 
schemas that Java/Flink would reject.
   
   Local validation I ran:
   
   - `python3.8 -m pytest -q paimon-python/pypaimon/tests/blob_table_test.py 
paimon-python/pypaimon/tests/blob_test.py` -> `87 passed, 1 skipped`
   - targeted BlobView tests -> passed
   - `flake8` on changed Python files -> passed
   - `compileall` on changed Python files -> passed
   - `git diff --check` -> passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to