Hi Parquet community hope all is well,

My name is Rahil Chertara, and I am an engineer working on open table
formats. I wanted to ask the community how to better configure Parquet
currently for vector storage and retrieval.

Typically from what I've seen, most major models generate vector embeddings
as an array of floating point values, with dimensions around 700-1500
elements (taking up about 3KB–6KB per vector)
https://developers.openai.com/api/docs/guides/embeddings
https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings

So my questions will be based on the above input.

1. Since Parquet does not have a logical VECTOR type, what data type does
the community recommend for writing vectors? My assumption is that most
users today would try parquet's LIST with FLOAT but are there other ways to
represent this better? Additionally, are there plans to add a VECTOR type
to Parquet in the future?

2. Since vectors have high cardinality, encodings such as DICTIONARY or RLE
might not be as useful. Is there a recommended encoding for users to
leverage today?

3. Is there a recommendation for tuning row group and page size for
vectors? For example is it always safe to set the row group size to one per
file and the page size to the size of one vector embedding record?

4. In general should users disable stats on these vector columns?

5. Is there a recommended compression codec for vectors or should they
generally be kept as uncompressed? If vector embeddings should be kept
uncompressed, then for parquet-java I believe we will need to allow per
column compression https://github.com/apache/parquet-java/pull/3396.

Thanks again for your assistance and help.

Regards,
Rahil Chertara

Reply via email to