Thanks Yonghao.

I will update these to PIP.

Best,
Jingsong

On Wed, Sep 24, 2025 at 4:08 PM Yonghao Fang <fangyonghao0...@gmail.com> wrote:
>
> Subject: [Proposal] Proposal for PyPaimon Blob Type Support with Arrow
>
> Hi all,
>
> Following a recent discussion, the proposal to put environment-related
> options
> in BlobDescriptor is not a good choice. These configurations—like
> credentials—
> are tied to the file system, not the individual data blob itself. A better
> approach is to store these environment-specific options in the catalog
> context
> options.
>
> Example usage:
>     table = catalog.get_table('database_name.table_name')
>     write_builder = table.new_batch_write_builder()
>     table_write = write_builder.new_write()
>
>     blob = Blob.from_http("http://path/to/file";)
>     blob_descriptor = blob.get_descriptor()
>     blob_descriptors = [blob_descriptor]
>     array = pa.array(blob_descriptors, type=pa.binary())
>     record_batch = pa.RecordBatch.from_arrays([array])
>     table_write.write_arrow_batch(record_batch)
>
>     restored_blob = Blob.from_descriptor(array[0].as_py())
>     input_stream = restored_blob.new_input_stream()
>
> Yonghao Fang <fangyonghao0...@gmail.com> 于2025年9月23日周二 15:26写道:
>
> > Subject: [Proposal] Proposal for PyPaimon Blob Type Support with Arrow
> >>>
> >>> Hi all,
> >>>
> >>> Following up on my previous proposal about Blob type support in Apache
> >>> Arrow, I'd like to refine my initial proposal to better balance
> >>> abstraction
> >>> and usability.
> >>>
> >>> My initial proposal for `BlobMeta` exposed too many implementation
> >>> details,
> >>> which is unnecessary.
> >>>
> >>> I propose a new, more abstract concept: the **BlobDescriptor**.
> >>>
> >>> The `BlobDescriptor` is a high-level object that encapsulates all the
> >>> necessary information to locate and access a blob, including:
> >>>
> >>> - **URI**: The storage location of the blob.
> >>> - **Length and Offset**: Specify the data range within the file.
> >>> - **Options**: A `map<string, string>` for storing additional
> >>> configuration,
> >>>   such as authentication credentials or storage-specific parameters.
> >>>
> >>> The `BlobDescriptor` serves as an opaque reference. Users interact with
> >>> this
> >>> object without needing to understand how its internal metadata is
> >>> structured
> >>> or parsed.
> >>>
> >>> We can explicitly state in the documentation that a Paimon `Blob` type
> >>> in an
> >>> Arrow array will never return raw binary data. Instead, it will strictly
> >>> return a serialized `BlobDescriptor`. This clearly distinguishes the
> >>> `Blob`
> >>> type from the generic `BINARY` type, preventing misuse.
> >>>
> >>> Example usage:
> >>>
> >>>     table = catalog.get_table('database_name.table_name')
> >>>     write_builder = table.new_batch_write_builder()
> >>>     table_write = write_builder.new_write()
> >>>
> >>>     blob = Blob.from_path("/path/to/file")
> >>>     blob_descriptor = blob.get_descriptor()
> >>>     blob_descriptors = [blob_descriptor]
> >>>     array = pa.array(blob_descriptors, type=pa.binary())
> >>>     record_batch = pa.RecordBatch.from_arrays([array])
> >>>     table_write.write_arrow_batch(record_batch)
> >>>
> >>>     restored_blob = Blob.from_descriptor(array[0].as_py())
> >>>     input_stream = restored_blob.new_input_stream()
> >>>
> >>>
> >>> Looking forward to your thoughts.
> >>>
> >>> Best regards,
> >>>
> >>> Yonghao Fang
> >>
> >>

Reply via email to