Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data

Yonghao Fang Sat, 18 Oct 2025 08:03:16 -0700
>
> Subject: [Proposal] Proposal for PyPaimon Blob Type Support with Arrow
>>
>> Hi all,
>>
>> Following up on my previous proposal about Blob type support in Apache
>> Arrow, I'd like to refine my initial proposal to better balance
>> abstraction
>> and usability.
>>
>> My initial proposal for `BlobMeta` exposed too many implementation
>> details,
>> which is unnecessary.
>>
>> I propose a new, more abstract concept: the **BlobDescriptor**.
>>
>> The `BlobDescriptor` is a high-level object that encapsulates all the
>> necessary information to locate and access a blob, including:
>>
>> - **URI**: The storage location of the blob.
>> - **Length and Offset**: Specify the data range within the file.
>> - **Options**: A `map<string, string>` for storing additional
>> configuration,
>>   such as authentication credentials or storage-specific parameters.
>>
>> The `BlobDescriptor` serves as an opaque reference. Users interact with
>> this
>> object without needing to understand how its internal metadata is
>> structured
>> or parsed.
>>
>> We can explicitly state in the documentation that a Paimon `Blob` type in
>> an
>> Arrow array will never return raw binary data. Instead, it will strictly
>> return a serialized `BlobDescriptor`. This clearly distinguishes the
>> `Blob`
>> type from the generic `BINARY` type, preventing misuse.
>>
>> Example usage:
>>
>>     table = catalog.get_table('database_name.table_name')
>>     write_builder = table.new_batch_write_builder()
>>     table_write = write_builder.new_write()
>>
>>     blob = Blob.from_path("/path/to/file")
>>     blob_descriptor = blob.get_descriptor()
>>     blob_descriptors = [blob_descriptor]
>>     array = pa.array(blob_descriptors, type=pa.binary())
>>     record_batch = pa.RecordBatch.from_arrays([array])
>>     table_write.write_arrow_batch(record_batch)
>>
>>     restored_blob = Blob.from_descriptor(array[0].as_py())
>>     input_stream = restored_blob.new_input_stream()
>>
>>
>> Looking forward to your thoughts.
>>
>> Best regards,
>>
>> Yonghao Fang
>
>
Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data

Reply via email to