Hi everyone, I want to start a discussion about blob files.
Multimodal data storage needs to support multimedia files, including text, images, audio, video, embedding vectors, etc. Paimon needs to meet the demand for multimodal data entering the lake, and achieve unified storage and efficient management of multimodal data and structured data. Most multimodal files are actually not large, around 1MB or even below 1MB, but there are also relatively large multimodal files, such as 10GB+files, which pose storage challenges for us. Consider two ways: 1. Multimodal data can be directly stored in column files, such as Parquet or Lance files. The biggest problem with this solution is that it brings challenges to the file format, such as solving the read and write problems of OOM, which requires a streaming API to the file format to avoid loading the entire multimodal data. In addition, the additional fields of multimodal data may undergo frequent changes, additions, or even deletions. If these changes require multimodal files to participate in reading and writing together, the cost is very high. 2. Multimodal data is stored on object storage, and Parquet references these files through pointers. The downside of doing so is that it cannot directly manage multimodal data and may result in a large number of small files, which can cause a significant amount of file IO during use, leading to decreased performance and increased costs. We should consider new ways to satisfy this requirement. Create a high-performance architecture specifically designed for mixed scenarios of massive small and large multimodal files, achieving high throughput writing and low latency reading, meeting the stringent performance requirements of AI, big data, and other businesses. A more intuitive solution is: independent multimodal storage and structured storage, separate management of multimodal storage, introduction of bin file mechanism to store multiple multimodal data, Parquet still references multimodal data through pointers. What do you think? [1] https://cwiki.apache.org/confluence/display/PAIMON/PIP-35%3A+Introduce+Blob+to+store+multimodal+data Best, Jingsong