>
> Subject: [Proposal] Proposal for PyPaimon Blob Type Support with Arrow
>>
>> Hi all,
>>
>> Following up on my previous proposal about Blob type support in Apache
>> Arrow, I'd like to refine my initial proposal to better balance
>> abstraction
>> and usability.
>>
>> My initial proposal for `BlobMeta` exposed too many implementation
>> details,
>> which is unnecessary.
>>
>> I propose a new, more abstract concept: the **BlobDescriptor**.
>>
>> The `BlobDescriptor` is a high-level object that encapsulates all the
>> necessary information to locate and access a blob, including:
>>
>> - **URI**: The storage location of the blob.
>> - **Length and Offset**: Specify the data range within the file.
>> - **Options**: A `map<string, string>` for storing additional
>> configuration,
>> such as authentication credentials or storage-specific parameters.
>>
>> The `BlobDescriptor` serves as an opaque reference. Users interact with
>> this
>> object without needing to understand how its internal metadata is
>> structured
>> or parsed.
>>
>> We can explicitly state in the documentation that a Paimon `Blob` type in
>> an
>> Arrow array will never return raw binary data. Instead, it will strictly
>> return a serialized `BlobDescriptor`. This clearly distinguishes the
>> `Blob`
>> type from the generic `BINARY` type, preventing misuse.
>>
>> Example usage:
>>
>> table = catalog.get_table('database_name.table_name')
>> write_builder = table.new_batch_write_builder()
>> table_write = write_builder.new_write()
>>
>> blob = Blob.from_path("/path/to/file")
>> blob_descriptor = blob.get_descriptor()
>> blob_descriptors = [blob_descriptor]
>> array = pa.array(blob_descriptors, type=pa.binary())
>> record_batch = pa.RecordBatch.from_arrays([array])
>> table_write.write_arrow_batch(record_batch)
>>
>> restored_blob = Blob.from_descriptor(array[0].as_py())
>> input_stream = restored_blob.new_input_stream()
>>
>>
>> Looking forward to your thoughts.
>>
>> Best regards,
>>
>> Yonghao Fang
>
>