steFaiz opened a new issue, #7881:
URL: https://github.com/apache/paimon/issues/7881

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found nothing similar.
   
   
   ### Motivation
   
   Currently, the Data-Evolution Table supports updating or adding regular 
columns as well as BlobDescriptor columns. However, in practical AI scenarios, 
it is quite common to need to add or update blob columns. For example:
   * Image preprocessing: rotation, cropping, augmentation, etc.
   * Video preprocessing: audio removal, frame extraction, etc.
   * Audio preprocessing, and so on.
   
   At present, updating Raw Blob columns is not allowed. This restriction 
exists primarily because the DataEvolution Update operation involves copying 
old files. If the column is of Blob type, copying the original Blob would incur 
significant overhead. Therefore, this operation is currently prohibited.
   
   
   
   ### Solution
   
   This approach introduces a Placeholder Blob, similar to null. During 
writing, the corresponding length is set to -2 (currently, null uses -1). The 
basic principle is as follows:
   
   <img width="422" height="316" alt="Image" 
src="https://github.com/user-attachments/assets/5274be6f-c515-4a9b-a215-f713dd3a8c51";
 />
   
   1.  During MergeInto: Instead of reading the Target table's Blob column, a 
Placeholder Blob is uniformly inserted.
   2.  In BlobFormatWriter: When writing, Placeholder Blobs are written 
directly with length = -2.
   3.  Rewrite FileBunch read logic for Blobs: Files within the same bunch are 
divided into groups based on max_seq_id. Within each max_seq_id group, a 
concatenated reader is constructed. If a Placeholder is encountered during 
reading, the system falls back to the reader from the previous msx_seq_id group.
   
   ### Anything else?
   
   ## Alternatives
   An alternative approach is to directly store BlobDescriptors in BlobFile, as 
below:
   
   <img width="422" height="316" alt="Image" 
src="https://github.com/user-attachments/assets/f593a818-29d9-4c45-855e-81295c78d81d";
 />
   
   But this approach is considered more complicated, especially for compaction. 
We should take care of concurrent merge into and compaction to avoid dangling 
pointer issue.
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to