FWIW, I ran across https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion/ which has different recommendations specifically for search but it seems to confirm some of your thoughts.
Cheers, Micah On Thursday, February 19, 2026, Micah Kornfield <[email protected]> wrote: > 1. Since Parquet does not have a logical VECTOR type, what data type does >> the community recommend for writing vectors? My assumption is that most >> users today would try parquet's LIST with FLOAT but are there other ways >> to >> represent this better? Additionally, are there plans to add a VECTOR type >> to Parquet in the future? > > > If the lists are fixed size, and you have the metadata stored someplace > externally, then using just FLOAT would be better (there is also the > logical type float16) which could be useful. There is a stale proposal to > support FIXED_SIZE_LIST which if someone has bandwidth we should maybe > revisit [1] > > 2. Since vectors have high cardinality, encodings such as DICTIONARY or RLE >> might not be as useful. Is there a recommended encoding for users to >> leverage today? > > I've heard anecdotally, the BYTE_STREAM_SPLIT + compression can work > pretty well with embedding data, but don't have first hand experience. I > would expect DICTIONARY/RLE would fall back pretty quickly to plain for > this type of data. ALP I think also has a proposed encoding for handling > more scientific like data. I think Prateek might be considering adding as > a follow-up (its at least been mentioned). > > 3. Is there a recommendation for tuning row group and page size for >> vectors? For example is it always safe to set the row group size to one >> per >> file and the page size to the size of one vector embedding record? > > > I don't have anything concrete here, but 1 page per vector feels small to > me. I'd imagine you would at least want to pack O(100 KB) if not more into > a page. > > 4. In general should users disable stats on these vector columns? > > > Yes, I don't think stats are particularly useful here. > > 5. Is there a recommended compression codec for vectors or should they >> generally be kept as uncompressed? If vector embeddings should be kept >> uncompressed, then for parquet-java I believe we will need to allow per >> column compression https://github.com/apache/parquet-java/pull/3396. > > > As mentioned above. I'd first try byte_stream_split + compression. I > think being able to turn compression on/off per column is likely useful > anyways given the other light-weight encodings we've been exploring. > Thanks for the contribution. I will try to do a first pass review but would > be great if someone more familiar with the java implementation could help. > > Cheers, > Micah > > > > > > [1] https://github.com/apache/parquet-format/pull/241 > > On Thu, Feb 19, 2026 at 2:42 PM Rahil C <[email protected]> wrote: > >> Hi Parquet community hope all is well, >> >> My name is Rahil Chertara, and I am an engineer working on open table >> formats. I wanted to ask the community how to better configure Parquet >> currently for vector storage and retrieval. >> >> Typically from what I've seen, most major models generate vector >> embeddings >> as an array of floating point values, with dimensions around 700-1500 >> elements (taking up about 3KB–6KB per vector) >> https://developers.openai.com/api/docs/guides/embeddings >> https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings >> >> So my questions will be based on the above input. >> >> 1. Since Parquet does not have a logical VECTOR type, what data type does >> the community recommend for writing vectors? My assumption is that most >> users today would try parquet's LIST with FLOAT but are there other ways >> to >> represent this better? Additionally, are there plans to add a VECTOR type >> to Parquet in the future? >> >> 2. Since vectors have high cardinality, encodings such as DICTIONARY or >> RLE >> might not be as useful. Is there a recommended encoding for users to >> leverage today? >> >> 3. Is there a recommendation for tuning row group and page size for >> vectors? For example is it always safe to set the row group size to one >> per >> file and the page size to the size of one vector embedding record? >> >> 4. In general should users disable stats on these vector columns? >> >> 5. Is there a recommended compression codec for vectors or should they >> generally be kept as uncompressed? If vector embeddings should be kept >> uncompressed, then for parquet-java I believe we will need to allow per >> column compression https://github.com/apache/parquet-java/pull/3396. >> >> Thanks again for your assistance and help. >> >> Regards, >> Rahil Chertara >> >
