Hi everyone,

I want to start a discussion about blob files.

Multimodal data storage needs to support multimedia files, including
text, images, audio, video, embedding vectors, etc. Paimon needs to
meet the demand for multimodal data entering the lake, and achieve
unified storage and efficient management of multimodal data and
structured data.

Most multimodal files are actually not large, around 1MB or even below
1MB, but there are also relatively large multimodal files, such as
10GB+files, which pose storage challenges for us.

Consider two ways:

1. Multimodal data can be directly stored in column files, such as
Parquet or Lance files. The biggest problem with this solution is that
it brings challenges to the file format, such as solving the read and
write problems of OOM, which requires a streaming API to the file
format to avoid loading the entire multimodal data. In addition, the
additional fields of multimodal data may undergo frequent changes,
additions, or even deletions. If these changes require multimodal
files to participate in reading and writing together, the cost is very
high.

2. Multimodal data is stored on object storage, and Parquet references
these files through pointers. The downside of doing so is that it
cannot directly manage multimodal data and may result in a large
number of small files, which can cause a significant amount of file IO
during use, leading to decreased performance and increased costs.

We should consider new ways to satisfy this requirement. Create a
high-performance architecture specifically designed for mixed
scenarios of massive small and large multimodal files, achieving high
throughput writing and low latency reading, meeting the stringent
performance requirements of AI, big data, and other businesses.

A more intuitive solution is: independent multimodal storage and
structured storage, separate management of multimodal storage,
introduction of bin file mechanism to store multiple multimodal data,
Parquet still references multimodal data through pointers.

What do you think?

[1] 
https://cwiki.apache.org/confluence/display/PAIMON/PIP-35%3A+Introduce+Blob+to+store+multimodal+data

Best,
Jingsong

Reply via email to