Fwd: [DISCUSS] PIP-35: Introduce Blob to store multimodal data

Jingsong Li Tue, 16 Sep 2025 07:15:09 -0700

Thanks Guoxing for your suggestion.

Now I have introduced the Blob interface:

/**
 * Blob interface, provide bytes and input stream methods.
 *
 * @since 1.4.0
 */
@Public
public interface Blob {

    byte[] toBytes();

    SeekableInputStream newInputStream() throws IOException;
}

And you can see the read and write example in PIP.

Best,
Jingsong

---------- Forwarded message ---------
From: guoxing wgx <guoxing....@gmail.com>
Date: Tue, Sep 16, 2025 at 7:47 PM
Subject: Re: [DISCUSS] PIP-35: Introduce Blob to store multimodal data
To: Jingsong Li <jingsongl...@gmail.com>

Following MySQL's BLOB Field Design, Can Paimon Also Support Streaming
Write Capabilities for BLOB Fields?

MySQL Large Object Storage

1. BINARY vs BLOB

MySQL supports two binary data types: BINARY and BLOB.

BINARY is a fixed-length binary string type, similar to CHAR, but it
stores raw bytes instead of characters. It is suitable for small,
fixed-size binary data.
BLOB (Binary Large Object) is a variable-length type designed to store
large amounts of binary data such as images, audio, video, documents,
and other file types.

Note: Currently, Apache Paimon only supports the Binary type and does
not have a dedicated BLOB type with streaming I/O capabilities.

2. Operation Interfaces

Input Streams (Writing Data)

When inserting or updating BLOB data, MySQL provides several methods
through the JDBC API:

setBinaryStream(int index, InputStream x, int length)
Writes binary data from an input stream into a BLOB field. This method
is recommended for streaming large files, as it avoids loading the
entire data into memory.

setBlob(int index, InputStream inputStream) (available since JDBC 4.0)
A more modern approach that writes BLOB data using an input stream
without requiring the length to be specified upfront. This simplifies
handling dynamically sized data.

setBytes(int index, byte[] bytes)
Directly writes a byte array to the BLOB field. This is appropriate
only for small files (e.g., less than 1MB), as it can lead to high
memory consumption and potential OutOfMemoryError (OOM) for larger
data.

Output Streams (Reading Data)

When retrieving BLOB data from a result set, streaming access is
supported to prevent memory issues:

getBinaryStream(String columnName)
Reads the BLOB value as an input stream, enabling chunked reading of
large files. This is the recommended way to handle large binary
objects and avoid OOM.

getBinaryStream(int index)
Similar to the above method, but accesses the column by its numeric
index instead of name. It is useful when the column order is known and
stable.

Large Object Handling (Blob)

In addition to direct stream access, MySQL allows working with the
java.sql.Blob interface for more advanced operations:

ResultSet.getBlob(String columnName)
Retrieves a java.sql.Blob object from the result set, which provides
additional methods for manipulation.

Blob.getBinaryStream()
Returns an input stream from the Blob object, typically used in
conjunction with ResultSet.getBlob() to enable lazy or on-demand
reading.

Blob.length()
Returns the size (in bytes) of the BLOB data. This is useful for
allocating buffers, determining file size, or managing partial reads.

Byte Array Access

ResultSet.getBytes(String columnName)
Reads the entire BLOB content directly into a byte array. While
convenient for small data, this method should be avoided for large
files, as it may cause OutOfMemoryError due to excessive memory usage.

________________________________

This completes the description of MySQL’s BLOB handling mechanisms,
focusing solely on factual presentation without additional analysis or
recommendations.

Jingsong Li <jingsongl...@gmail.com> 于2025年9月16日周二 19:30写道：
>
> From Guoxing in another thread:
>
> Following MySQL's BLOB field design, can Paimon also support streaming
> write capabilities for BLOB fields?
> MySQL Large Object Storage
>
> 1. BINARY vs BLOB
>
> *Note: MySQL supports both BINARY and BLOB types, whereas Paimon currently
> only supports Binary*
> Type
> Description
> BINARY Fixed-length binary string type, similar to CHAR, but stores bytes
> instead of characters.
> BLOB Variable-length binary large object type, used to store large amounts
> of binary data (e.g., images, audio, files).
> ------------------------------
> 2. Operation InterfacesInput Streams (Writing Data)
> Category
> Method
> Purpose
> Statement setBinaryStream(int index, InputStream x, int length) Writes
> binary stream data into a BLOB field; used for inserting or updating BLOB
> data. Recommended for streaming writes.
> setBlob(int index, InputStream inputStream) Writes BLOB data using an input
> stream (JDBC 4.0+). A more modern approach that does not require specifying
> the length.
> setBytes(int index, byte[] bytes) Directly writes a byte array. Suitable
> only for small files (<1MB); be cautious about memory usage.
> Output Streams (Reading Data)
> Category
> Method
> Purpose
> ResultSet getBinaryStream(String columnName) Reads BLOB data as an input
> stream. Recommended for streaming large files to avoid OOM.
> getBinaryStream(int index) Same as above, but accesses by column index.
> Equivalent to using column name, useful when column order is known.
> Large Object Handling (Blob)
> Category
> Method
> Purpose
> Blob ResultSet.getBlob(String columnName) Retrieves a java.sql.Blob object,
> which provides additional methods for manipulation.
> Blob.getBinaryStream() Gets an input stream from the Blob object. Used in
> conjunction with ResultSet.getBlob().
> Blob.length() Returns the size (length) of the BLOB data. Useful for
> determining file size or allocating buffers.
> Byte Array Access
> Category
> Method
> Purpose
> Bytes ResultSet.getBytes(String columnName) Reads the entire BLOB directly
> into a byte array. Only suitable for small files, as large files may cause
> OutOfMemoryError (OOM).
> ------------------------------
>
> This comparison highlights that MySQL provides robust streaming I/O support 
> for
> BLOBs, enabling efficient handling of large binary objects without loading
> them entirely into memory — a capability that could be valuable to
> implement in Paimon for better multimodal data management.
>
> On Tue, Sep 16, 2025 at 3:08 PM Jingsong Li <jingsongl...@gmail.com> wrote:
> >
> > Hi everyone,
> >
> > I want to start a discussion about blob files.
> >
> > Multimodal data storage needs to support multimedia files, including
> > text, images, audio, video, embedding vectors, etc. Paimon needs to
> > meet the demand for multimodal data entering the lake, and achieve
> > unified storage and efficient management of multimodal data and
> > structured data.
> >
> > Most multimodal files are actually not large, around 1MB or even below
> > 1MB, but there are also relatively large multimodal files, such as
> > 10GB+files, which pose storage challenges for us.
> >
> > Consider two ways:
> >
> > 1. Multimodal data can be directly stored in column files, such as
> > Parquet or Lance files. The biggest problem with this solution is that
> > it brings challenges to the file format, such as solving the read and
> > write problems of OOM, which requires a streaming API to the file
> > format to avoid loading the entire multimodal data. In addition, the
> > additional fields of multimodal data may undergo frequent changes,
> > additions, or even deletions. If these changes require multimodal
> > files to participate in reading and writing together, the cost is very
> > high.
> >
> > 2. Multimodal data is stored on object storage, and Parquet references
> > these files through pointers. The downside of doing so is that it
> > cannot directly manage multimodal data and may result in a large
> > number of small files, which can cause a significant amount of file IO
> > during use, leading to decreased performance and increased costs.
> >
> > We should consider new ways to satisfy this requirement. Create a
> > high-performance architecture specifically designed for mixed
> > scenarios of massive small and large multimodal files, achieving high
> > throughput writing and low latency reading, meeting the stringent
> > performance requirements of AI, big data, and other businesses.
> >
> > A more intuitive solution is: independent multimodal storage and
> > structured storage, separate management of multimodal storage,
> > introduction of bin file mechanism to store multiple multimodal data,
> > Parquet still references multimodal data through pointers.
> >
> > What do you think?
> >
> > [1] 
> > https://cwiki.apache.org/confluence/display/PAIMON/PIP-35%3A+Introduce+Blob+to+store+multimodal+data
> >
> > Best,
> > Jingsong

Fwd: [DISCUSS] PIP-35: Introduce Blob to store multimodal data

Reply via email to