Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data

Yonghao Fang Sat, 18 Oct 2025 12:21:07 -0700

Subject: [Proposal] Proposal for PyPaimon Blob Type Support with Arrow

Hi all,


Following a recent discussion, the proposal to put environment-related
options
in BlobDescriptor is not a good choice. These configurations—like
credentials—
are tied to the file system, not the individual data blob itself. A better
approach is to store these environment-specific options in the catalog
context
options.

Example usage:
    table = catalog.get_table('database_name.table_name')
    write_builder = table.new_batch_write_builder()
    table_write = write_builder.new_write()

    blob = Blob.from_http("http://path/to/file";)
    blob_descriptor = blob.get_descriptor()
    blob_descriptors = [blob_descriptor]
    array = pa.array(blob_descriptors, type=pa.binary())
    record_batch = pa.RecordBatch.from_arrays([array])
    table_write.write_arrow_batch(record_batch)

    restored_blob = Blob.from_descriptor(array[0].as_py())
    input_stream = restored_blob.new_input_stream()

Yonghao Fang <[email protected]> 于2025年9月23日周二 15:26写道：

> Subject: [Proposal] Proposal for PyPaimon Blob Type Support with Arrow
>>>
>>> Hi all,
>>>
>>> Following up on my previous proposal about Blob type support in Apache
>>> Arrow, I'd like to refine my initial proposal to better balance
>>> abstraction
>>> and usability.
>>>
>>> My initial proposal for `BlobMeta` exposed too many implementation
>>> details,
>>> which is unnecessary.
>>>
>>> I propose a new, more abstract concept: the **BlobDescriptor**.
>>>
>>> The `BlobDescriptor` is a high-level object that encapsulates all the
>>> necessary information to locate and access a blob, including:
>>>
>>> - **URI**: The storage location of the blob.
>>> - **Length and Offset**: Specify the data range within the file.
>>> - **Options**: A `map<string, string>` for storing additional
>>> configuration,
>>>   such as authentication credentials or storage-specific parameters.
>>>
>>> The `BlobDescriptor` serves as an opaque reference. Users interact with
>>> this
>>> object without needing to understand how its internal metadata is
>>> structured
>>> or parsed.
>>>
>>> We can explicitly state in the documentation that a Paimon `Blob` type
>>> in an
>>> Arrow array will never return raw binary data. Instead, it will strictly
>>> return a serialized `BlobDescriptor`. This clearly distinguishes the
>>> `Blob`
>>> type from the generic `BINARY` type, preventing misuse.
>>>
>>> Example usage:
>>>
>>>     table = catalog.get_table('database_name.table_name')
>>>     write_builder = table.new_batch_write_builder()
>>>     table_write = write_builder.new_write()
>>>
>>>     blob = Blob.from_path("/path/to/file")
>>>     blob_descriptor = blob.get_descriptor()
>>>     blob_descriptors = [blob_descriptor]
>>>     array = pa.array(blob_descriptors, type=pa.binary())
>>>     record_batch = pa.RecordBatch.from_arrays([array])
>>>     table_write.write_arrow_batch(record_batch)
>>>
>>>     restored_blob = Blob.from_descriptor(array[0].as_py())
>>>     input_stream = restored_blob.new_input_stream()
>>>
>>>
>>> Looking forward to your thoughts.
>>>
>>> Best regards,
>>>
>>> Yonghao Fang
>>
>>

Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data

Reply via email to