Re: [DISCUSS] Configuring Parquet for storing and retrieving vector embeddings

Rahil C Mon, 23 Feb 2026 17:57:23 -0800

Thanks Micah for the helpful pointers and the initial pass on the column
compression PR <https://github.com/apache/parquet-java/pull/3396>, I
appreciate it greatly. What you mentioned aligns with some micro-benchmarks
I've been running that compare Parquet against the Lance file format for
writing and reading vectors.

The experiment involved writing 10,000 vectors (each with 1,536 dimensions,
where elements are 4-byte FLOATs, resulting in about 6KB per record) and
using the respective file format's Java APIs:
* We performed a full round trip: writing all vectors to the file and then
reading them back.
* For Parquet we tried several combinations of writing with different
physical type backings (LIST<FLOAT> and FIXED_LEN_BYTE_ARRAY), as well as
relevant encodings(PLAIN, BYTE_STREAM-SPLIT) mentioned in Prateek's ALP doc
[1], and different compressions (SNAPPY, ZSTD, UNCOMPRESSED). We also
disabled dictionary encoding and disabled statistics on the vector
embedding column. Finally we also tuned the row group size to match the
file size (effectively one row group) and the page size to the size of one
vector embedding, as mentioned in Julien's blog [2] and the blog you shared
above from Xiangpeng [3].
* For Lance we opted to use vanilla settings based on its claims of already
handling vectors optimally. Under the hood my understanding is that Lance
uses Apache Arrow's FixedSizeList for vectors.
* We performed 5 warmup rounds and 10 measurement rounds and collected the
averages below.
* The experiment was conducted on a local machine's file system as a quick
test to get initial signals.

An initial summary of the results:
* Parquet LIST (byte-stream-split, ZSTD) had the most compact file size but
the difference compared to other combinations was minimal.
* Parquet FIXED performed better than PARQUET LIST by a wide margin in all
combinations.
* Lance was the fastest overall on writes but not by a large margin
compared to Parquet FIXED.
* Parquet FIXED (byte-stream-split, UNCOMPRESSED) was fastest on reads
across all combinations.
I have attached a gist here for others to view the full results:
https://gist.github.com/rahil-c/066f689f91cdb91204a3fb4a9f2aefac

Regarding the original FIXED_SIZE_LIST logical type PR
<https://github.com/apache/parquet-format/pull/241/changes#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR292>,
backing it with Parquet's primitive FIXED_LEN_BYTE_ARRAY makes sense to me.
As long as you know the vector's dimension (D) and the element type (such
as float32), you can allocate (D * 4) bytes.
I am curious if anyone in the community plans to revisit this or if this is
open for volunteers?

Links:
1.
https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0#heading=h.5xf60mx6q7xk
2. https://sympathetic.ink/2025/12/11/Column-Storage-for-the-AI-era.html
3.
https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion

Regards,
Rahil Chertara

On Sat, Feb 21, 2026 at 11:05 AM Micah Kornfield <[email protected]>
wrote:

> FWIW, I ran across
> https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion/
> which has different recommendations specifically for search but it seems to
> confirm some of your thoughts.
>
> Cheers,
> Micah
>
> On Thursday, February 19, 2026, Micah Kornfield <[email protected]>
> wrote:
>
> > 1. Since Parquet does not have a logical VECTOR type, what data type does
> >> the community recommend for writing vectors? My assumption is that most
> >> users today would try parquet's LIST with FLOAT but are there other ways
> >> to
> >> represent this better? Additionally, are there plans to add a VECTOR
> type
> >> to Parquet in the future?
> >
> >
> > If the lists are fixed size, and you have the metadata stored someplace
> > externally, then using just FLOAT would be better (there is also the
> > logical type float16) which could be useful.   There is a stale proposal
> to
> > support FIXED_SIZE_LIST which if someone has bandwidth we should maybe
> > revisit [1]
> >
> > 2. Since vectors have high cardinality, encodings such as DICTIONARY or
> RLE
> >> might not be as useful. Is there a recommended encoding for users to
> >> leverage today?
> >
> > I've heard anecdotally, the BYTE_STREAM_SPLIT + compression can work
> > pretty well with embedding data, but don't have first hand experience.  I
> > would expect DICTIONARY/RLE would fall back pretty quickly to plain for
> > this type of data.  ALP I think also has a proposed encoding for handling
> > more scientific like data.  I think Prateek might be considering adding
> as
> > a follow-up (its at least been mentioned).
> >
> >  3. Is there a recommendation for tuning row group and page size for
> >> vectors? For example is it always safe to set the row group size to one
> >> per
> >> file and the page size to the size of one vector embedding record?
> >
> >
> > I don't have anything concrete here, but 1 page per vector feels small to
> > me. I'd imagine you would at least want to pack O(100 KB) if not more
> into
> > a page.
> >
> > 4. In general should users disable stats on these vector columns?
> >
> >
> > Yes, I don't think stats are particularly useful here.
> >
> > 5. Is there a recommended compression codec for vectors or should they
> >> generally be kept as uncompressed? If vector embeddings should be kept
> >> uncompressed, then for parquet-java I believe we will need to allow per
> >> column compression https://github.com/apache/parquet-java/pull/3396.
> >
> >
> > As mentioned above. I'd first try byte_stream_split + compression.   I
> > think being able to turn compression on/off per column is likely useful
> > anyways given the other light-weight encodings we've been exploring.
> > Thanks for the contribution. I will try to do a first pass review but
> would
> > be great if someone more familiar with the java implementation could
> help.
> >
> > Cheers,
> > Micah
> >
> >
> >
> >
> >
> > [1] https://github.com/apache/parquet-format/pull/241
> >
> > On Thu, Feb 19, 2026 at 2:42 PM Rahil C <[email protected]> wrote:
> >
> >> Hi Parquet community hope all is well,
> >>
> >> My name is Rahil Chertara, and I am an engineer working on open table
> >> formats. I wanted to ask the community how to better configure Parquet
> >> currently for vector storage and retrieval.
> >>
> >> Typically from what I've seen, most major models generate vector
> >> embeddings
> >> as an array of floating point values, with dimensions around 700-1500
> >> elements (taking up about 3KB–6KB per vector)
> >> https://developers.openai.com/api/docs/guides/embeddings
> >> https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings
> >>
> >> So my questions will be based on the above input.
> >>
> >> 1. Since Parquet does not have a logical VECTOR type, what data type
> does
> >> the community recommend for writing vectors? My assumption is that most
> >> users today would try parquet's LIST with FLOAT but are there other ways
> >> to
> >> represent this better? Additionally, are there plans to add a VECTOR
> type
> >> to Parquet in the future?
> >>
> >> 2. Since vectors have high cardinality, encodings such as DICTIONARY or
> >> RLE
> >> might not be as useful. Is there a recommended encoding for users to
> >> leverage today?
> >>
> >> 3. Is there a recommendation for tuning row group and page size for
> >> vectors? For example is it always safe to set the row group size to one
> >> per
> >> file and the page size to the size of one vector embedding record?
> >>
> >> 4. In general should users disable stats on these vector columns?
> >>
> >> 5. Is there a recommended compression codec for vectors or should they
> >> generally be kept as uncompressed? If vector embeddings should be kept
> >> uncompressed, then for parquet-java I believe we will need to allow per
> >> column compression https://github.com/apache/parquet-java/pull/3396.
> >>
> >> Thanks again for your assistance and help.
> >>
> >> Regards,
> >> Rahil Chertara
> >>
> >
>

Re: [DISCUSS] Configuring Parquet for storing and retrieving vector embeddings

Reply via email to