Re: [DISCUSS] Configuring Parquet for storing and retrieving vector embeddings

Micah Kornfield Thu, 19 Feb 2026 15:52:47 -0800

>
> 1. Since Parquet does not have a logical VECTOR type, what data type does
> the community recommend for writing vectors? My assumption is that most
> users today would try parquet's LIST with FLOAT but are there other ways to
> represent this better? Additionally, are there plans to add a VECTOR type
> to Parquet in the future?

If the lists are fixed size, and you have the metadata stored someplace
externally, then using just FLOAT would be better (there is also the
logical type float16) which could be useful.   There is a stale proposal to
support FIXED_SIZE_LIST which if someone has bandwidth we should maybe
revisit [1]

2. Since vectors have high cardinality, encodings such as DICTIONARY or RLE
> might not be as useful. Is there a recommended encoding for users to
> leverage today?

I've heard anecdotally, the BYTE_STREAM_SPLIT + compression can work pretty
well with embedding data, but don't have first hand experience.  I would
expect DICTIONARY/RLE would fall back pretty quickly to plain for this type
of data.  ALP I think also has a proposed encoding for handling more
scientific like data.  I think Prateek might be considering adding as a
follow-up (its at least been mentioned).

 3. Is there a recommendation for tuning row group and page size for
> vectors? For example is it always safe to set the row group size to one per
> file and the page size to the size of one vector embedding record?

I don't have anything concrete here, but 1 page per vector feels small to
me. I'd imagine you would at least want to pack O(100 KB) if not more into
a page.

4. In general should users disable stats on these vector columns?

Yes, I don't think stats are particularly useful here.

5. Is there a recommended compression codec for vectors or should they
> generally be kept as uncompressed? If vector embeddings should be kept
> uncompressed, then for parquet-java I believe we will need to allow per
> column compression https://github.com/apache/parquet-java/pull/3396.

As mentioned above. I'd first try byte_stream_split + compression.   I
think being able to turn compression on/off per column is likely useful
anyways given the other light-weight encodings we've been exploring.
Thanks for the contribution. I will try to do a first pass review but would
be great if someone more familiar with the java implementation could help.

Cheers,
Micah

[1] https://github.com/apache/parquet-format/pull/241

On Thu, Feb 19, 2026 at 2:42 PM Rahil C <[email protected]> wrote:

> Hi Parquet community hope all is well,
>
> My name is Rahil Chertara, and I am an engineer working on open table
> formats. I wanted to ask the community how to better configure Parquet
> currently for vector storage and retrieval.
>
> Typically from what I've seen, most major models generate vector embeddings
> as an array of floating point values, with dimensions around 700-1500
> elements (taking up about 3KB–6KB per vector)
> https://developers.openai.com/api/docs/guides/embeddings
> https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings
>
> So my questions will be based on the above input.
>
> 1. Since Parquet does not have a logical VECTOR type, what data type does
> the community recommend for writing vectors? My assumption is that most
> users today would try parquet's LIST with FLOAT but are there other ways to
> represent this better? Additionally, are there plans to add a VECTOR type
> to Parquet in the future?
>
> 2. Since vectors have high cardinality, encodings such as DICTIONARY or RLE
> might not be as useful. Is there a recommended encoding for users to
> leverage today?
>
> 3. Is there a recommendation for tuning row group and page size for
> vectors? For example is it always safe to set the row group size to one per
> file and the page size to the size of one vector embedding record?
>
> 4. In general should users disable stats on these vector columns?
>
> 5. Is there a recommended compression codec for vectors or should they
> generally be kept as uncompressed? If vector embeddings should be kept
> uncompressed, then for parquet-java I believe we will need to allow per
> column compression https://github.com/apache/parquet-java/pull/3396.
>
> Thanks again for your assistance and help.
>
> Regards,
> Rahil Chertara
>

Re: [DISCUSS] Configuring Parquet for storing and retrieving vector embeddings

Reply via email to