steFaiz opened a new issue, #7881: URL: https://github.com/apache/paimon/issues/7881
### Search before asking - [x] I searched in the [issues](https://github.com/apache/paimon/issues) and found nothing similar. ### Motivation Currently, the Data-Evolution Table supports updating or adding regular columns as well as BlobDescriptor columns. However, in practical AI scenarios, it is quite common to need to add or update blob columns. For example: * Image preprocessing: rotation, cropping, augmentation, etc. * Video preprocessing: audio removal, frame extraction, etc. * Audio preprocessing, and so on. At present, updating Raw Blob columns is not allowed. This restriction exists primarily because the DataEvolution Update operation involves copying old files. If the column is of Blob type, copying the original Blob would incur significant overhead. Therefore, this operation is currently prohibited. ### Solution This approach introduces a Placeholder Blob, similar to null. During writing, the corresponding length is set to -2 (currently, null uses -1). The basic principle is as follows: <img width="422" height="316" alt="Image" src="https://github.com/user-attachments/assets/5274be6f-c515-4a9b-a215-f713dd3a8c51" /> 1. During MergeInto: Instead of reading the Target table's Blob column, a Placeholder Blob is uniformly inserted. 2. In BlobFormatWriter: When writing, Placeholder Blobs are written directly with length = -2. 3. Rewrite FileBunch read logic for Blobs: Files within the same bunch are divided into groups based on max_seq_id. Within each max_seq_id group, a concatenated reader is constructed. If a Placeholder is encountered during reading, the system falls back to the reader from the previous msx_seq_id group. ### Anything else? ## Alternatives An alternative approach is to directly store BlobDescriptors in BlobFile, as below: <img width="422" height="316" alt="Image" src="https://github.com/user-attachments/assets/f593a818-29d9-4c45-855e-81295c78d81d" /> But this approach is considered more complicated, especially for compaction. We should take care of concurrent merge into and compaction to avoid dangling pointer issue. ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
