Re: [DISCUSS] Configuring Parquet for storing and retrieving vector embeddings

Micah Kornfield Sat, 21 Feb 2026 11:05:14 -0800

FWIW, I ran across
https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion/
which has different recommendations specifically for search but it seems to
confirm some of your thoughts.


Cheers,
Micah

On Thursday, February 19, 2026, Micah Kornfield <[email protected]>
wrote:

> 1. Since Parquet does not have a logical VECTOR type, what data type does
>> the community recommend for writing vectors? My assumption is that most
>> users today would try parquet's LIST with FLOAT but are there other ways
>> to
>> represent this better? Additionally, are there plans to add a VECTOR type
>> to Parquet in the future?
>
>
> If the lists are fixed size, and you have the metadata stored someplace
> externally, then using just FLOAT would be better (there is also the
> logical type float16) which could be useful.   There is a stale proposal to
> support FIXED_SIZE_LIST which if someone has bandwidth we should maybe
> revisit [1]
>
> 2. Since vectors have high cardinality, encodings such as DICTIONARY or RLE
>> might not be as useful. Is there a recommended encoding for users to
>> leverage today?
>
> I've heard anecdotally, the BYTE_STREAM_SPLIT + compression can work
> pretty well with embedding data, but don't have first hand experience.  I
> would expect DICTIONARY/RLE would fall back pretty quickly to plain for
> this type of data.  ALP I think also has a proposed encoding for handling
> more scientific like data.  I think Prateek might be considering adding as
> a follow-up (its at least been mentioned).
>
>  3. Is there a recommendation for tuning row group and page size for
>> vectors? For example is it always safe to set the row group size to one
>> per
>> file and the page size to the size of one vector embedding record?
>
>
> I don't have anything concrete here, but 1 page per vector feels small to
> me. I'd imagine you would at least want to pack O(100 KB) if not more into
> a page.
>
> 4. In general should users disable stats on these vector columns?
>
>
> Yes, I don't think stats are particularly useful here.
>
> 5. Is there a recommended compression codec for vectors or should they
>> generally be kept as uncompressed? If vector embeddings should be kept
>> uncompressed, then for parquet-java I believe we will need to allow per
>> column compression https://github.com/apache/parquet-java/pull/3396.
>
>
> As mentioned above. I'd first try byte_stream_split + compression.   I
> think being able to turn compression on/off per column is likely useful
> anyways given the other light-weight encodings we've been exploring.
> Thanks for the contribution. I will try to do a first pass review but would
> be great if someone more familiar with the java implementation could help.
>
> Cheers,
> Micah
>
>
>
>
>
> [1] https://github.com/apache/parquet-format/pull/241
>
> On Thu, Feb 19, 2026 at 2:42 PM Rahil C <[email protected]> wrote:
>
>> Hi Parquet community hope all is well,
>>
>> My name is Rahil Chertara, and I am an engineer working on open table
>> formats. I wanted to ask the community how to better configure Parquet
>> currently for vector storage and retrieval.
>>
>> Typically from what I've seen, most major models generate vector
>> embeddings
>> as an array of floating point values, with dimensions around 700-1500
>> elements (taking up about 3KB–6KB per vector)
>> https://developers.openai.com/api/docs/guides/embeddings
>> https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings
>>
>> So my questions will be based on the above input.
>>
>> 1. Since Parquet does not have a logical VECTOR type, what data type does
>> the community recommend for writing vectors? My assumption is that most
>> users today would try parquet's LIST with FLOAT but are there other ways
>> to
>> represent this better? Additionally, are there plans to add a VECTOR type
>> to Parquet in the future?
>>
>> 2. Since vectors have high cardinality, encodings such as DICTIONARY or
>> RLE
>> might not be as useful. Is there a recommended encoding for users to
>> leverage today?
>>
>> 3. Is there a recommendation for tuning row group and page size for
>> vectors? For example is it always safe to set the row group size to one
>> per
>> file and the page size to the size of one vector embedding record?
>>
>> 4. In general should users disable stats on these vector columns?
>>
>> 5. Is there a recommended compression codec for vectors or should they
>> generally be kept as uncompressed? If vector embeddings should be kept
>> uncompressed, then for parquet-java I believe we will need to allow per
>> column compression https://github.com/apache/parquet-java/pull/3396.
>>
>> Thanks again for your assistance and help.
>>
>> Regards,
>> Rahil Chertara
>>
>

Re: [DISCUSS] Configuring Parquet for storing and retrieving vector embeddings

Reply via email to