leaves12138 commented on PR #7837: URL: https://github.com/apache/paimon/pull/7837#issuecomment-4447464839
Thanks for the PR. The overall direction looks right to me: the Python `BlobViewStruct` wire format matches the Java side, and the write/read path for inline `blob-view-field` plus `blob-as-descriptor=true` is mostly aligned with the BlobView design. I found one blocker around the data-evolution read path though. `DataFileBatchReader` already resolves inline `blob-view-field` values to the actual blob payload when `blob-as-descriptor=false`. In the data-evolution path, `DataEvolutionSplitRead` then wraps the merged reader with `BlobDescriptorConvertReader`, and that reader scans the already-resolved bytes again with `BlobViewStruct.is_blob_view_struct()` / `deserialize()`. This means a perfectly valid blob payload whose first bytes happen to match the BlobView magic header can be parsed as a `BlobViewStruct` and fail, or even be incorrectly resolved as another view. The problematic chain is: - `DataFileBatchReader._convert_inline_blob_columns` resolves view fields before returning file batches. - `DataEvolutionSplitRead.create_reader` wraps the result with `BlobDescriptorConvertReader` whenever blob descriptor/view fields exist. - `BlobDescriptorConvertReader._convert_batch` attempts to interpret the final bytes as `BlobViewStruct` again. I reproduced this locally by writing a source blob payload starting with the BlobView version + magic bytes and then reading it through a downstream `blob-view-field`; the default read failed with `ValueError: Invalid BlobViewStruct data: too short` even though the source payload is just normal blob data. I think we should ensure inline blob values are converted exactly once. One possible fix is to keep conversion in `DataFileBatchReader` and not wrap data-evolution readers with `BlobDescriptorConvertReader` for fields that have already been converted; alternatively, disable the file-reader-side conversion in the data-evolution path and do conversion only after merge/filter. One more compatibility point: Java validation requires fields listed in `blob-descriptor-field` / `blob-view-field` to also be listed in `blob-field`. The Python schema validation currently only checks that those names are BLOB fields and do not overlap. If this relaxation is intentional because Python treats all `large_binary` columns as BLOBs, please document it; otherwise it would be better to align the option validation with Java to avoid creating schemas that Java/Flink would reject. Local validation I ran: - `python3.8 -m pytest -q paimon-python/pypaimon/tests/blob_table_test.py paimon-python/pypaimon/tests/blob_test.py` -> `87 passed, 1 skipped` - targeted BlobView tests -> passed - `flake8` on changed Python files -> passed - `compileall` on changed Python files -> passed - `git diff --check` -> passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
