> > 1. Since Parquet does not have a logical VECTOR type, what data type does > the community recommend for writing vectors? My assumption is that most > users today would try parquet's LIST with FLOAT but are there other ways to > represent this better? Additionally, are there plans to add a VECTOR type > to Parquet in the future?
If the lists are fixed size, and you have the metadata stored someplace externally, then using just FLOAT would be better (there is also the logical type float16) which could be useful. There is a stale proposal to support FIXED_SIZE_LIST which if someone has bandwidth we should maybe revisit [1] 2. Since vectors have high cardinality, encodings such as DICTIONARY or RLE > might not be as useful. Is there a recommended encoding for users to > leverage today? I've heard anecdotally, the BYTE_STREAM_SPLIT + compression can work pretty well with embedding data, but don't have first hand experience. I would expect DICTIONARY/RLE would fall back pretty quickly to plain for this type of data. ALP I think also has a proposed encoding for handling more scientific like data. I think Prateek might be considering adding as a follow-up (its at least been mentioned). 3. Is there a recommendation for tuning row group and page size for > vectors? For example is it always safe to set the row group size to one per > file and the page size to the size of one vector embedding record? I don't have anything concrete here, but 1 page per vector feels small to me. I'd imagine you would at least want to pack O(100 KB) if not more into a page. 4. In general should users disable stats on these vector columns? Yes, I don't think stats are particularly useful here. 5. Is there a recommended compression codec for vectors or should they > generally be kept as uncompressed? If vector embeddings should be kept > uncompressed, then for parquet-java I believe we will need to allow per > column compression https://github.com/apache/parquet-java/pull/3396. As mentioned above. I'd first try byte_stream_split + compression. I think being able to turn compression on/off per column is likely useful anyways given the other light-weight encodings we've been exploring. Thanks for the contribution. I will try to do a first pass review but would be great if someone more familiar with the java implementation could help. Cheers, Micah [1] https://github.com/apache/parquet-format/pull/241 On Thu, Feb 19, 2026 at 2:42 PM Rahil C <[email protected]> wrote: > Hi Parquet community hope all is well, > > My name is Rahil Chertara, and I am an engineer working on open table > formats. I wanted to ask the community how to better configure Parquet > currently for vector storage and retrieval. > > Typically from what I've seen, most major models generate vector embeddings > as an array of floating point values, with dimensions around 700-1500 > elements (taking up about 3KB–6KB per vector) > https://developers.openai.com/api/docs/guides/embeddings > https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings > > So my questions will be based on the above input. > > 1. Since Parquet does not have a logical VECTOR type, what data type does > the community recommend for writing vectors? My assumption is that most > users today would try parquet's LIST with FLOAT but are there other ways to > represent this better? Additionally, are there plans to add a VECTOR type > to Parquet in the future? > > 2. Since vectors have high cardinality, encodings such as DICTIONARY or RLE > might not be as useful. Is there a recommended encoding for users to > leverage today? > > 3. Is there a recommendation for tuning row group and page size for > vectors? For example is it always safe to set the row group size to one per > file and the page size to the size of one vector embedding record? > > 4. In general should users disable stats on these vector columns? > > 5. Is there a recommended compression codec for vectors or should they > generally be kept as uncompressed? If vector embeddings should be kept > uncompressed, then for parquet-java I believe we will need to allow per > column compression https://github.com/apache/parquet-java/pull/3396. > > Thanks again for your assistance and help. > > Regards, > Rahil Chertara >
