Thanks Yonghao. I will update these to PIP.
Best, Jingsong On Wed, Sep 24, 2025 at 4:08 PM Yonghao Fang <fangyonghao0...@gmail.com> wrote: > > Subject: [Proposal] Proposal for PyPaimon Blob Type Support with Arrow > > Hi all, > > Following a recent discussion, the proposal to put environment-related > options > in BlobDescriptor is not a good choice. These configurations—like > credentials— > are tied to the file system, not the individual data blob itself. A better > approach is to store these environment-specific options in the catalog > context > options. > > Example usage: > table = catalog.get_table('database_name.table_name') > write_builder = table.new_batch_write_builder() > table_write = write_builder.new_write() > > blob = Blob.from_http("http://path/to/file") > blob_descriptor = blob.get_descriptor() > blob_descriptors = [blob_descriptor] > array = pa.array(blob_descriptors, type=pa.binary()) > record_batch = pa.RecordBatch.from_arrays([array]) > table_write.write_arrow_batch(record_batch) > > restored_blob = Blob.from_descriptor(array[0].as_py()) > input_stream = restored_blob.new_input_stream() > > Yonghao Fang <fangyonghao0...@gmail.com> 于2025年9月23日周二 15:26写道: > > > Subject: [Proposal] Proposal for PyPaimon Blob Type Support with Arrow > >>> > >>> Hi all, > >>> > >>> Following up on my previous proposal about Blob type support in Apache > >>> Arrow, I'd like to refine my initial proposal to better balance > >>> abstraction > >>> and usability. > >>> > >>> My initial proposal for `BlobMeta` exposed too many implementation > >>> details, > >>> which is unnecessary. > >>> > >>> I propose a new, more abstract concept: the **BlobDescriptor**. > >>> > >>> The `BlobDescriptor` is a high-level object that encapsulates all the > >>> necessary information to locate and access a blob, including: > >>> > >>> - **URI**: The storage location of the blob. > >>> - **Length and Offset**: Specify the data range within the file. > >>> - **Options**: A `map<string, string>` for storing additional > >>> configuration, > >>> such as authentication credentials or storage-specific parameters. > >>> > >>> The `BlobDescriptor` serves as an opaque reference. Users interact with > >>> this > >>> object without needing to understand how its internal metadata is > >>> structured > >>> or parsed. > >>> > >>> We can explicitly state in the documentation that a Paimon `Blob` type > >>> in an > >>> Arrow array will never return raw binary data. Instead, it will strictly > >>> return a serialized `BlobDescriptor`. This clearly distinguishes the > >>> `Blob` > >>> type from the generic `BINARY` type, preventing misuse. > >>> > >>> Example usage: > >>> > >>> table = catalog.get_table('database_name.table_name') > >>> write_builder = table.new_batch_write_builder() > >>> table_write = write_builder.new_write() > >>> > >>> blob = Blob.from_path("/path/to/file") > >>> blob_descriptor = blob.get_descriptor() > >>> blob_descriptors = [blob_descriptor] > >>> array = pa.array(blob_descriptors, type=pa.binary()) > >>> record_batch = pa.RecordBatch.from_arrays([array]) > >>> table_write.write_arrow_batch(record_batch) > >>> > >>> restored_blob = Blob.from_descriptor(array[0].as_py()) > >>> input_stream = restored_blob.new_input_stream() > >>> > >>> > >>> Looking forward to your thoughts. > >>> > >>> Best regards, > >>> > >>> Yonghao Fang > >> > >>