Re: [DISCUSS] Configuring Parquet for storing and retrieving vector embeddings

Rahil C Mon, 02 Mar 2026 14:07:50 -0800

Raising this thread again to see if others in the community have thoughts
on this topic and to discuss potential next steps.


I believe Rok had an original proposal here
https://github.com/apache/parquet-format/pull/241 for adding a
FIXED_SIZE_LIST logical type, which users can leverage in the future for
writing vector embeddings.
The initial proposed spec makes sense to me so I am wondering if the
community needs help with any other items outside the spec PR? If so, I do
not mind assisting with items in parquet-java.

Regards,
Rahil Chertara

On Mon, Feb 23, 2026 at 5:56 PM Rahil C <[email protected]> wrote:

> Thanks Micah for the helpful pointers and the initial pass on the column
> compression PR <https://github.com/apache/parquet-java/pull/3396>, I
> appreciate it greatly. What you mentioned aligns with some micro-benchmarks
> I've been running that compare Parquet against the Lance file format for
> writing and reading vectors.
>
> The experiment involved writing 10,000 vectors (each with 1,536
> dimensions, where elements are 4-byte FLOATs, resulting in about 6KB per
> record) and using the respective file format's Java APIs:
> * We performed a full round trip: writing all vectors to the file and then
> reading them back.
> * For Parquet we tried several combinations of writing with different
> physical type backings (LIST<FLOAT> and FIXED_LEN_BYTE_ARRAY), as well as
> relevant encodings(PLAIN, BYTE_STREAM-SPLIT) mentioned in Prateek's ALP doc
> [1], and different compressions (SNAPPY, ZSTD, UNCOMPRESSED). We also
> disabled dictionary encoding and disabled statistics on the vector
> embedding column. Finally we also tuned the row group size to match the
> file size (effectively one row group) and the page size to the size of one
> vector embedding, as mentioned in Julien's blog [2] and the blog you shared
> above from Xiangpeng [3].
> * For Lance we opted to use vanilla settings based on its claims of
> already handling vectors optimally. Under the hood my understanding is that
> Lance uses Apache Arrow's FixedSizeList for vectors.
> * We performed 5 warmup rounds and 10 measurement rounds and collected the
> averages below.
> * The experiment was conducted on a local machine's file system as a quick
> test to get initial signals.
>
> An initial summary of the results:
> * Parquet LIST (byte-stream-split, ZSTD) had the most compact file size
> but the difference compared to other combinations was minimal.
> * Parquet FIXED performed better than PARQUET LIST by a wide margin in all
> combinations.
> * Lance was the fastest overall on writes but not by a large margin
> compared to Parquet FIXED.
> * Parquet FIXED (byte-stream-split, UNCOMPRESSED) was fastest on reads
> across all combinations.
> I have attached a gist here for others to view the full results:
> https://gist.github.com/rahil-c/066f689f91cdb91204a3fb4a9f2aefac
>
> Regarding the original FIXED_SIZE_LIST logical type PR
> <https://github.com/apache/parquet-format/pull/241/changes#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR292>,
> backing it with Parquet's primitive FIXED_LEN_BYTE_ARRAY makes sense to me.
> As long as you know the vector's dimension (D) and the element type (such
> as float32), you can allocate (D * 4) bytes.
> I am curious if anyone in the community plans to revisit this or if this
> is open for volunteers?
>
> Links:
> 1.
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0#heading=h.5xf60mx6q7xk
> 2. https://sympathetic.ink/2025/12/11/Column-Storage-for-the-AI-era.html
> 3.
> https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion
>
> Regards,
> Rahil Chertara
>
> On Sat, Feb 21, 2026 at 11:05 AM Micah Kornfield <[email protected]>
> wrote:
>
>> FWIW, I ran across
>>
>> https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion/
>> which has different recommendations specifically for search but it seems
>> to
>> confirm some of your thoughts.
>>
>> Cheers,
>> Micah
>>
>> On Thursday, February 19, 2026, Micah Kornfield <[email protected]>
>> wrote:
>>
>> > 1. Since Parquet does not have a logical VECTOR type, what data type
>> does
>> >> the community recommend for writing vectors? My assumption is that most
>> >> users today would try parquet's LIST with FLOAT but are there other
>> ways
>> >> to
>> >> represent this better? Additionally, are there plans to add a VECTOR
>> type
>> >> to Parquet in the future?
>> >
>> >
>> > If the lists are fixed size, and you have the metadata stored someplace
>> > externally, then using just FLOAT would be better (there is also the
>> > logical type float16) which could be useful.   There is a stale
>> proposal to
>> > support FIXED_SIZE_LIST which if someone has bandwidth we should maybe
>> > revisit [1]
>> >
>> > 2. Since vectors have high cardinality, encodings such as DICTIONARY or
>> RLE
>> >> might not be as useful. Is there a recommended encoding for users to
>> >> leverage today?
>> >
>> > I've heard anecdotally, the BYTE_STREAM_SPLIT + compression can work
>> > pretty well with embedding data, but don't have first hand experience.
>> I
>> > would expect DICTIONARY/RLE would fall back pretty quickly to plain for
>> > this type of data.  ALP I think also has a proposed encoding for
>> handling
>> > more scientific like data.  I think Prateek might be considering adding
>> as
>> > a follow-up (its at least been mentioned).
>> >
>> >  3. Is there a recommendation for tuning row group and page size for
>> >> vectors? For example is it always safe to set the row group size to one
>> >> per
>> >> file and the page size to the size of one vector embedding record?
>> >
>> >
>> > I don't have anything concrete here, but 1 page per vector feels small
>> to
>> > me. I'd imagine you would at least want to pack O(100 KB) if not more
>> into
>> > a page.
>> >
>> > 4. In general should users disable stats on these vector columns?
>> >
>> >
>> > Yes, I don't think stats are particularly useful here.
>> >
>> > 5. Is there a recommended compression codec for vectors or should they
>> >> generally be kept as uncompressed? If vector embeddings should be kept
>> >> uncompressed, then for parquet-java I believe we will need to allow per
>> >> column compression https://github.com/apache/parquet-java/pull/3396.
>> >
>> >
>> > As mentioned above. I'd first try byte_stream_split + compression.   I
>> > think being able to turn compression on/off per column is likely useful
>> > anyways given the other light-weight encodings we've been exploring.
>> > Thanks for the contribution. I will try to do a first pass review but
>> would
>> > be great if someone more familiar with the java implementation could
>> help.
>> >
>> > Cheers,
>> > Micah
>> >
>> >
>> >
>> >
>> >
>> > [1] https://github.com/apache/parquet-format/pull/241
>> >
>> > On Thu, Feb 19, 2026 at 2:42 PM Rahil C <[email protected]> wrote:
>> >
>> >> Hi Parquet community hope all is well,
>> >>
>> >> My name is Rahil Chertara, and I am an engineer working on open table
>> >> formats. I wanted to ask the community how to better configure Parquet
>> >> currently for vector storage and retrieval.
>> >>
>> >> Typically from what I've seen, most major models generate vector
>> >> embeddings
>> >> as an array of floating point values, with dimensions around 700-1500
>> >> elements (taking up about 3KB–6KB per vector)
>> >> https://developers.openai.com/api/docs/guides/embeddings
>> >> https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings
>> >>
>> >> So my questions will be based on the above input.
>> >>
>> >> 1. Since Parquet does not have a logical VECTOR type, what data type
>> does
>> >> the community recommend for writing vectors? My assumption is that most
>> >> users today would try parquet's LIST with FLOAT but are there other
>> ways
>> >> to
>> >> represent this better? Additionally, are there plans to add a VECTOR
>> type
>> >> to Parquet in the future?
>> >>
>> >> 2. Since vectors have high cardinality, encodings such as DICTIONARY or
>> >> RLE
>> >> might not be as useful. Is there a recommended encoding for users to
>> >> leverage today?
>> >>
>> >> 3. Is there a recommendation for tuning row group and page size for
>> >> vectors? For example is it always safe to set the row group size to one
>> >> per
>> >> file and the page size to the size of one vector embedding record?
>> >>
>> >> 4. In general should users disable stats on these vector columns?
>> >>
>> >> 5. Is there a recommended compression codec for vectors or should they
>> >> generally be kept as uncompressed? If vector embeddings should be kept
>> >> uncompressed, then for parquet-java I believe we will need to allow per
>> >> column compression https://github.com/apache/parquet-java/pull/3396.
>> >>
>> >> Thanks again for your assistance and help.
>> >>
>> >> Regards,
>> >> Rahil Chertara
>> >>
>> >
>>
>

Re: [DISCUSS] Configuring Parquet for storing and retrieving vector embeddings

Reply via email to