Subject: [Proposal] Proposal for PyPaimon Blob Type Support with Arrow
Hi all,
Following a recent discussion, the proposal to put environment-related
options
in BlobDescriptor is not a good choice. These configurations—like
credentials—
are tied to the file system, not the individual data blob itself. A better
approach is to store these environment-specific options in the catalog
context
options.
Example usage:
table = catalog.get_table('database_name.table_name')
write_builder = table.new_batch_write_builder()
table_write = write_builder.new_write()
blob = Blob.from_http("http://path/to/file")
blob_descriptor = blob.get_descriptor()
blob_descriptors = [blob_descriptor]
array = pa.array(blob_descriptors, type=pa.binary())
record_batch = pa.RecordBatch.from_arrays([array])
table_write.write_arrow_batch(record_batch)
restored_blob = Blob.from_descriptor(array[0].as_py())
input_stream = restored_blob.new_input_stream()
Yonghao Fang <[email protected]> 于2025年9月23日周二 15:26写道:
> Subject: [Proposal] Proposal for PyPaimon Blob Type Support with Arrow
>>>
>>> Hi all,
>>>
>>> Following up on my previous proposal about Blob type support in Apache
>>> Arrow, I'd like to refine my initial proposal to better balance
>>> abstraction
>>> and usability.
>>>
>>> My initial proposal for `BlobMeta` exposed too many implementation
>>> details,
>>> which is unnecessary.
>>>
>>> I propose a new, more abstract concept: the **BlobDescriptor**.
>>>
>>> The `BlobDescriptor` is a high-level object that encapsulates all the
>>> necessary information to locate and access a blob, including:
>>>
>>> - **URI**: The storage location of the blob.
>>> - **Length and Offset**: Specify the data range within the file.
>>> - **Options**: A `map<string, string>` for storing additional
>>> configuration,
>>> such as authentication credentials or storage-specific parameters.
>>>
>>> The `BlobDescriptor` serves as an opaque reference. Users interact with
>>> this
>>> object without needing to understand how its internal metadata is
>>> structured
>>> or parsed.
>>>
>>> We can explicitly state in the documentation that a Paimon `Blob` type
>>> in an
>>> Arrow array will never return raw binary data. Instead, it will strictly
>>> return a serialized `BlobDescriptor`. This clearly distinguishes the
>>> `Blob`
>>> type from the generic `BINARY` type, preventing misuse.
>>>
>>> Example usage:
>>>
>>> table = catalog.get_table('database_name.table_name')
>>> write_builder = table.new_batch_write_builder()
>>> table_write = write_builder.new_write()
>>>
>>> blob = Blob.from_path("/path/to/file")
>>> blob_descriptor = blob.get_descriptor()
>>> blob_descriptors = [blob_descriptor]
>>> array = pa.array(blob_descriptors, type=pa.binary())
>>> record_batch = pa.RecordBatch.from_arrays([array])
>>> table_write.write_arrow_batch(record_batch)
>>>
>>> restored_blob = Blob.from_descriptor(array[0].as_py())
>>> input_stream = restored_blob.new_input_stream()
>>>
>>>
>>> Looking forward to your thoughts.
>>>
>>> Best regards,
>>>
>>> Yonghao Fang
>>
>>