Re: [DISCUSS] Configuring Parquet for storing and retrieving vector embeddings

Andrew Lamb Fri, 06 Mar 2026 00:36:15 -0800

I agree that adding a FIXED_SIZE_LIST type to remove the overhead of
storing (the same) list length for every element sounds like a clear
improvement, and has a natural mapping to the Arrow Fixed Size Byte Array.


On Tue, Mar 3, 2026 at 11:11 AM Rok Mihevc <[email protected]> wrote:

> Thanks for working and reporting on this, Rahil! I've updated the
> FIXED_SIZE_LIST PR with received feedback and reported to the original ML
> thread [1]. It would be great to bring more scientific data support
> into Parquet!
> More feedback from the community would be most welcome!
>
> Best,
> Rok
>
> [1] https://lists.apache.org/thread/xot5f3ghhtc82n1bf0wdl9zqwlrzqks3
>
> On Mon, Mar 2, 2026 at 11:07 PM Rahil C <[email protected]> wrote:
>
> > Raising this thread again to see if others in the community have thoughts
> > on this topic and to discuss potential next steps.
> >
> > I believe Rok had an original proposal here
> > https://github.com/apache/parquet-format/pull/241 for adding a
> > FIXED_SIZE_LIST logical type, which users can leverage in the future for
> > writing vector embeddings.
> > The initial proposed spec makes sense to me so I am wondering if the
> > community needs help with any other items outside the spec PR? If so, I
> do
> > not mind assisting with items in parquet-java.
> >
> > Regards,
> > Rahil Chertara
> >
> > On Mon, Feb 23, 2026 at 5:56 PM Rahil C <[email protected]> wrote:
> >
> > > Thanks Micah for the helpful pointers and the initial pass on the
> column
> > > compression PR <https://github.com/apache/parquet-java/pull/3396>, I
> > > appreciate it greatly. What you mentioned aligns with some
> > micro-benchmarks
> > > I've been running that compare Parquet against the Lance file format
> for
> > > writing and reading vectors.
> > >
> > > The experiment involved writing 10,000 vectors (each with 1,536
> > > dimensions, where elements are 4-byte FLOATs, resulting in about 6KB
> per
> > > record) and using the respective file format's Java APIs:
> > > * We performed a full round trip: writing all vectors to the file and
> > then
> > > reading them back.
> > > * For Parquet we tried several combinations of writing with different
> > > physical type backings (LIST<FLOAT> and FIXED_LEN_BYTE_ARRAY), as well
> as
> > > relevant encodings(PLAIN, BYTE_STREAM-SPLIT) mentioned in Prateek's ALP
> > doc
> > > [1], and different compressions (SNAPPY, ZSTD, UNCOMPRESSED). We also
> > > disabled dictionary encoding and disabled statistics on the vector
> > > embedding column. Finally we also tuned the row group size to match the
> > > file size (effectively one row group) and the page size to the size of
> > one
> > > vector embedding, as mentioned in Julien's blog [2] and the blog you
> > shared
> > > above from Xiangpeng [3].
> > > * For Lance we opted to use vanilla settings based on its claims of
> > > already handling vectors optimally. Under the hood my understanding is
> > that
> > > Lance uses Apache Arrow's FixedSizeList for vectors.
> > > * We performed 5 warmup rounds and 10 measurement rounds and collected
> > the
> > > averages below.
> > > * The experiment was conducted on a local machine's file system as a
> > quick
> > > test to get initial signals.
> > >
> > > An initial summary of the results:
> > > * Parquet LIST (byte-stream-split, ZSTD) had the most compact file size
> > > but the difference compared to other combinations was minimal.
> > > * Parquet FIXED performed better than PARQUET LIST by a wide margin in
> > all
> > > combinations.
> > > * Lance was the fastest overall on writes but not by a large margin
> > > compared to Parquet FIXED.
> > > * Parquet FIXED (byte-stream-split, UNCOMPRESSED) was fastest on reads
> > > across all combinations.
> > > I have attached a gist here for others to view the full results:
> > > https://gist.github.com/rahil-c/066f689f91cdb91204a3fb4a9f2aefac
> > >
> > > Regarding the original FIXED_SIZE_LIST logical type PR
> > > <
> >
> https://github.com/apache/parquet-format/pull/241/changes#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR292
> > >,
> > > backing it with Parquet's primitive FIXED_LEN_BYTE_ARRAY makes sense to
> > me.
> > > As long as you know the vector's dimension (D) and the element type
> (such
> > > as float32), you can allocate (D * 4) bytes.
> > > I am curious if anyone in the community plans to revisit this or if
> this
> > > is open for volunteers?
> > >
> > > Links:
> > > 1.
> > >
> >
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0#heading=h.5xf60mx6q7xk
> > > 2.
> https://sympathetic.ink/2025/12/11/Column-Storage-for-the-AI-era.html
> > > 3.
> > >
> >
> https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion
> > >
> > > Regards,
> > > Rahil Chertara
> > >
> > > On Sat, Feb 21, 2026 at 11:05 AM Micah Kornfield <
> [email protected]>
> > > wrote:
> > >
> > >> FWIW, I ran across
> > >>
> > >>
> >
> https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion/
> > >> which has different recommendations specifically for search but it
> seems
> > >> to
> > >> confirm some of your thoughts.
> > >>
> > >> Cheers,
> > >> Micah
> > >>
> > >> On Thursday, February 19, 2026, Micah Kornfield <
> [email protected]>
> > >> wrote:
> > >>
> > >> > 1. Since Parquet does not have a logical VECTOR type, what data type
> > >> does
> > >> >> the community recommend for writing vectors? My assumption is that
> > most
> > >> >> users today would try parquet's LIST with FLOAT but are there other
> > >> ways
> > >> >> to
> > >> >> represent this better? Additionally, are there plans to add a
> VECTOR
> > >> type
> > >> >> to Parquet in the future?
> > >> >
> > >> >
> > >> > If the lists are fixed size, and you have the metadata stored
> > someplace
> > >> > externally, then using just FLOAT would be better (there is also the
> > >> > logical type float16) which could be useful.   There is a stale
> > >> proposal to
> > >> > support FIXED_SIZE_LIST which if someone has bandwidth we should
> maybe
> > >> > revisit [1]
> > >> >
> > >> > 2. Since vectors have high cardinality, encodings such as DICTIONARY
> > or
> > >> RLE
> > >> >> might not be as useful. Is there a recommended encoding for users
> to
> > >> >> leverage today?
> > >> >
> > >> > I've heard anecdotally, the BYTE_STREAM_SPLIT + compression can work
> > >> > pretty well with embedding data, but don't have first hand
> experience.
> > >> I
> > >> > would expect DICTIONARY/RLE would fall back pretty quickly to plain
> > for
> > >> > this type of data.  ALP I think also has a proposed encoding for
> > >> handling
> > >> > more scientific like data.  I think Prateek might be considering
> > adding
> > >> as
> > >> > a follow-up (its at least been mentioned).
> > >> >
> > >> >  3. Is there a recommendation for tuning row group and page size for
> > >> >> vectors? For example is it always safe to set the row group size to
> > one
> > >> >> per
> > >> >> file and the page size to the size of one vector embedding record?
> > >> >
> > >> >
> > >> > I don't have anything concrete here, but 1 page per vector feels
> small
> > >> to
> > >> > me. I'd imagine you would at least want to pack O(100 KB) if not
> more
> > >> into
> > >> > a page.
> > >> >
> > >> > 4. In general should users disable stats on these vector columns?
> > >> >
> > >> >
> > >> > Yes, I don't think stats are particularly useful here.
> > >> >
> > >> > 5. Is there a recommended compression codec for vectors or should
> they
> > >> >> generally be kept as uncompressed? If vector embeddings should be
> > kept
> > >> >> uncompressed, then for parquet-java I believe we will need to allow
> > per
> > >> >> column compression
> https://github.com/apache/parquet-java/pull/3396.
> > >> >
> > >> >
> > >> > As mentioned above. I'd first try byte_stream_split + compression.
>  I
> > >> > think being able to turn compression on/off per column is likely
> > useful
> > >> > anyways given the other light-weight encodings we've been exploring.
> > >> > Thanks for the contribution. I will try to do a first pass review
> but
> > >> would
> > >> > be great if someone more familiar with the java implementation could
> > >> help.
> > >> >
> > >> > Cheers,
> > >> > Micah
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > [1] https://github.com/apache/parquet-format/pull/241
> > >> >
> > >> > On Thu, Feb 19, 2026 at 2:42 PM Rahil C <[email protected]>
> wrote:
> > >> >
> > >> >> Hi Parquet community hope all is well,
> > >> >>
> > >> >> My name is Rahil Chertara, and I am an engineer working on open
> table
> > >> >> formats. I wanted to ask the community how to better configure
> > Parquet
> > >> >> currently for vector storage and retrieval.
> > >> >>
> > >> >> Typically from what I've seen, most major models generate vector
> > >> >> embeddings
> > >> >> as an array of floating point values, with dimensions around
> 700-1500
> > >> >> elements (taking up about 3KB–6KB per vector)
> > >> >> https://developers.openai.com/api/docs/guides/embeddings
> > >> >>
> > https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings
> > >> >>
> > >> >> So my questions will be based on the above input.
> > >> >>
> > >> >> 1. Since Parquet does not have a logical VECTOR type, what data
> type
> > >> does
> > >> >> the community recommend for writing vectors? My assumption is that
> > most
> > >> >> users today would try parquet's LIST with FLOAT but are there other
> > >> ways
> > >> >> to
> > >> >> represent this better? Additionally, are there plans to add a
> VECTOR
> > >> type
> > >> >> to Parquet in the future?
> > >> >>
> > >> >> 2. Since vectors have high cardinality, encodings such as
> DICTIONARY
> > or
> > >> >> RLE
> > >> >> might not be as useful. Is there a recommended encoding for users
> to
> > >> >> leverage today?
> > >> >>
> > >> >> 3. Is there a recommendation for tuning row group and page size for
> > >> >> vectors? For example is it always safe to set the row group size to
> > one
> > >> >> per
> > >> >> file and the page size to the size of one vector embedding record?
> > >> >>
> > >> >> 4. In general should users disable stats on these vector columns?
> > >> >>
> > >> >> 5. Is there a recommended compression codec for vectors or should
> > they
> > >> >> generally be kept as uncompressed? If vector embeddings should be
> > kept
> > >> >> uncompressed, then for parquet-java I believe we will need to allow
> > per
> > >> >> column compression
> https://github.com/apache/parquet-java/pull/3396.
> > >> >>
> > >> >> Thanks again for your assistance and help.
> > >> >>
> > >> >> Regards,
> > >> >> Rahil Chertara
> > >> >>
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] Configuring Parquet for storing and retrieving vector embeddings

Reply via email to