Hi Parquet community hope all is well, My name is Rahil Chertara, and I am an engineer working on open table formats. I wanted to ask the community how to better configure Parquet currently for vector storage and retrieval.
Typically from what I've seen, most major models generate vector embeddings as an array of floating point values, with dimensions around 700-1500 elements (taking up about 3KB–6KB per vector) https://developers.openai.com/api/docs/guides/embeddings https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings So my questions will be based on the above input. 1. Since Parquet does not have a logical VECTOR type, what data type does the community recommend for writing vectors? My assumption is that most users today would try parquet's LIST with FLOAT but are there other ways to represent this better? Additionally, are there plans to add a VECTOR type to Parquet in the future? 2. Since vectors have high cardinality, encodings such as DICTIONARY or RLE might not be as useful. Is there a recommended encoding for users to leverage today? 3. Is there a recommendation for tuning row group and page size for vectors? For example is it always safe to set the row group size to one per file and the page size to the size of one vector embedding record? 4. In general should users disable stats on these vector columns? 5. Is there a recommended compression codec for vectors or should they generally be kept as uncompressed? If vector embeddings should be kept uncompressed, then for parquet-java I believe we will need to allow per column compression https://github.com/apache/parquet-java/pull/3396. Thanks again for your assistance and help. Regards, Rahil Chertara
