Re: [DISCUSS] Configuring Parquet for storing and retrieving vector embeddings

Rok Mihevc Tue, 03 Mar 2026 08:12:34 -0800

Thanks for working and reporting on this, Rahil! I've updated the
FIXED_SIZE_LIST PR with received feedback and reported to the original ML
thread [1]. It would be great to bring more scientific data support
into Parquet!
More feedback from the community would be most welcome!


Best,
Rok

[1] https://lists.apache.org/thread/xot5f3ghhtc82n1bf0wdl9zqwlrzqks3

On Mon, Mar 2, 2026 at 11:07 PM Rahil C <[email protected]> wrote:

> Raising this thread again to see if others in the community have thoughts
> on this topic and to discuss potential next steps.
>
> I believe Rok had an original proposal here
> https://github.com/apache/parquet-format/pull/241 for adding a
> FIXED_SIZE_LIST logical type, which users can leverage in the future for
> writing vector embeddings.
> The initial proposed spec makes sense to me so I am wondering if the
> community needs help with any other items outside the spec PR? If so, I do
> not mind assisting with items in parquet-java.
>
> Regards,
> Rahil Chertara
>
> On Mon, Feb 23, 2026 at 5:56 PM Rahil C <[email protected]> wrote:
>
> > Thanks Micah for the helpful pointers and the initial pass on the column
> > compression PR <https://github.com/apache/parquet-java/pull/3396>, I
> > appreciate it greatly. What you mentioned aligns with some
> micro-benchmarks
> > I've been running that compare Parquet against the Lance file format for
> > writing and reading vectors.
> >
> > The experiment involved writing 10,000 vectors (each with 1,536
> > dimensions, where elements are 4-byte FLOATs, resulting in about 6KB per
> > record) and using the respective file format's Java APIs:
> > * We performed a full round trip: writing all vectors to the file and
> then
> > reading them back.
> > * For Parquet we tried several combinations of writing with different
> > physical type backings (LIST<FLOAT> and FIXED_LEN_BYTE_ARRAY), as well as
> > relevant encodings(PLAIN, BYTE_STREAM-SPLIT) mentioned in Prateek's ALP
> doc
> > [1], and different compressions (SNAPPY, ZSTD, UNCOMPRESSED). We also
> > disabled dictionary encoding and disabled statistics on the vector
> > embedding column. Finally we also tuned the row group size to match the
> > file size (effectively one row group) and the page size to the size of
> one
> > vector embedding, as mentioned in Julien's blog [2] and the blog you
> shared
> > above from Xiangpeng [3].
> > * For Lance we opted to use vanilla settings based on its claims of
> > already handling vectors optimally. Under the hood my understanding is
> that
> > Lance uses Apache Arrow's FixedSizeList for vectors.
> > * We performed 5 warmup rounds and 10 measurement rounds and collected
> the
> > averages below.
> > * The experiment was conducted on a local machine's file system as a
> quick
> > test to get initial signals.
> >
> > An initial summary of the results:
> > * Parquet LIST (byte-stream-split, ZSTD) had the most compact file size
> > but the difference compared to other combinations was minimal.
> > * Parquet FIXED performed better than PARQUET LIST by a wide margin in
> all
> > combinations.
> > * Lance was the fastest overall on writes but not by a large margin
> > compared to Parquet FIXED.
> > * Parquet FIXED (byte-stream-split, UNCOMPRESSED) was fastest on reads
> > across all combinations.
> > I have attached a gist here for others to view the full results:
> > https://gist.github.com/rahil-c/066f689f91cdb91204a3fb4a9f2aefac
> >
> > Regarding the original FIXED_SIZE_LIST logical type PR
> > <
> https://github.com/apache/parquet-format/pull/241/changes#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR292
> >,
> > backing it with Parquet's primitive FIXED_LEN_BYTE_ARRAY makes sense to
> me.
> > As long as you know the vector's dimension (D) and the element type (such
> > as float32), you can allocate (D * 4) bytes.
> > I am curious if anyone in the community plans to revisit this or if this
> > is open for volunteers?
> >
> > Links:
> > 1.
> >
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0#heading=h.5xf60mx6q7xk
> > 2. https://sympathetic.ink/2025/12/11/Column-Storage-for-the-AI-era.html
> > 3.
> >
> https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion
> >
> > Regards,
> > Rahil Chertara
> >
> > On Sat, Feb 21, 2026 at 11:05 AM Micah Kornfield <[email protected]>
> > wrote:
> >
> >> FWIW, I ran across
> >>
> >>
> https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion/
> >> which has different recommendations specifically for search but it seems
> >> to
> >> confirm some of your thoughts.
> >>
> >> Cheers,
> >> Micah
> >>
> >> On Thursday, February 19, 2026, Micah Kornfield <[email protected]>
> >> wrote:
> >>
> >> > 1. Since Parquet does not have a logical VECTOR type, what data type
> >> does
> >> >> the community recommend for writing vectors? My assumption is that
> most
> >> >> users today would try parquet's LIST with FLOAT but are there other
> >> ways
> >> >> to
> >> >> represent this better? Additionally, are there plans to add a VECTOR
> >> type
> >> >> to Parquet in the future?
> >> >
> >> >
> >> > If the lists are fixed size, and you have the metadata stored
> someplace
> >> > externally, then using just FLOAT would be better (there is also the
> >> > logical type float16) which could be useful.   There is a stale
> >> proposal to
> >> > support FIXED_SIZE_LIST which if someone has bandwidth we should maybe
> >> > revisit [1]
> >> >
> >> > 2. Since vectors have high cardinality, encodings such as DICTIONARY
> or
> >> RLE
> >> >> might not be as useful. Is there a recommended encoding for users to
> >> >> leverage today?
> >> >
> >> > I've heard anecdotally, the BYTE_STREAM_SPLIT + compression can work
> >> > pretty well with embedding data, but don't have first hand experience.
> >> I
> >> > would expect DICTIONARY/RLE would fall back pretty quickly to plain
> for
> >> > this type of data.  ALP I think also has a proposed encoding for
> >> handling
> >> > more scientific like data.  I think Prateek might be considering
> adding
> >> as
> >> > a follow-up (its at least been mentioned).
> >> >
> >> >  3. Is there a recommendation for tuning row group and page size for
> >> >> vectors? For example is it always safe to set the row group size to
> one
> >> >> per
> >> >> file and the page size to the size of one vector embedding record?
> >> >
> >> >
> >> > I don't have anything concrete here, but 1 page per vector feels small
> >> to
> >> > me. I'd imagine you would at least want to pack O(100 KB) if not more
> >> into
> >> > a page.
> >> >
> >> > 4. In general should users disable stats on these vector columns?
> >> >
> >> >
> >> > Yes, I don't think stats are particularly useful here.
> >> >
> >> > 5. Is there a recommended compression codec for vectors or should they
> >> >> generally be kept as uncompressed? If vector embeddings should be
> kept
> >> >> uncompressed, then for parquet-java I believe we will need to allow
> per
> >> >> column compression https://github.com/apache/parquet-java/pull/3396.
> >> >
> >> >
> >> > As mentioned above. I'd first try byte_stream_split + compression.   I
> >> > think being able to turn compression on/off per column is likely
> useful
> >> > anyways given the other light-weight encodings we've been exploring.
> >> > Thanks for the contribution. I will try to do a first pass review but
> >> would
> >> > be great if someone more familiar with the java implementation could
> >> help.
> >> >
> >> > Cheers,
> >> > Micah
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > [1] https://github.com/apache/parquet-format/pull/241
> >> >
> >> > On Thu, Feb 19, 2026 at 2:42 PM Rahil C <[email protected]> wrote:
> >> >
> >> >> Hi Parquet community hope all is well,
> >> >>
> >> >> My name is Rahil Chertara, and I am an engineer working on open table
> >> >> formats. I wanted to ask the community how to better configure
> Parquet
> >> >> currently for vector storage and retrieval.
> >> >>
> >> >> Typically from what I've seen, most major models generate vector
> >> >> embeddings
> >> >> as an array of floating point values, with dimensions around 700-1500
> >> >> elements (taking up about 3KB–6KB per vector)
> >> >> https://developers.openai.com/api/docs/guides/embeddings
> >> >>
> https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings
> >> >>
> >> >> So my questions will be based on the above input.
> >> >>
> >> >> 1. Since Parquet does not have a logical VECTOR type, what data type
> >> does
> >> >> the community recommend for writing vectors? My assumption is that
> most
> >> >> users today would try parquet's LIST with FLOAT but are there other
> >> ways
> >> >> to
> >> >> represent this better? Additionally, are there plans to add a VECTOR
> >> type
> >> >> to Parquet in the future?
> >> >>
> >> >> 2. Since vectors have high cardinality, encodings such as DICTIONARY
> or
> >> >> RLE
> >> >> might not be as useful. Is there a recommended encoding for users to
> >> >> leverage today?
> >> >>
> >> >> 3. Is there a recommendation for tuning row group and page size for
> >> >> vectors? For example is it always safe to set the row group size to
> one
> >> >> per
> >> >> file and the page size to the size of one vector embedding record?
> >> >>
> >> >> 4. In general should users disable stats on these vector columns?
> >> >>
> >> >> 5. Is there a recommended compression codec for vectors or should
> they
> >> >> generally be kept as uncompressed? If vector embeddings should be
> kept
> >> >> uncompressed, then for parquet-java I believe we will need to allow
> per
> >> >> column compression https://github.com/apache/parquet-java/pull/3396.
> >> >>
> >> >> Thanks again for your assistance and help.
> >> >>
> >> >> Regards,
> >> >> Rahil Chertara
> >> >>
> >> >
> >>
> >
>

Re: [DISCUSS] Configuring Parquet for storing and retrieving vector embeddings

Reply via email to