Raising this thread again to see if others in the community have thoughts on this topic and to discuss potential next steps.
I believe Rok had an original proposal here https://github.com/apache/parquet-format/pull/241 for adding a FIXED_SIZE_LIST logical type, which users can leverage in the future for writing vector embeddings. The initial proposed spec makes sense to me so I am wondering if the community needs help with any other items outside the spec PR? If so, I do not mind assisting with items in parquet-java. Regards, Rahil Chertara On Mon, Feb 23, 2026 at 5:56 PM Rahil C <[email protected]> wrote: > Thanks Micah for the helpful pointers and the initial pass on the column > compression PR <https://github.com/apache/parquet-java/pull/3396>, I > appreciate it greatly. What you mentioned aligns with some micro-benchmarks > I've been running that compare Parquet against the Lance file format for > writing and reading vectors. > > The experiment involved writing 10,000 vectors (each with 1,536 > dimensions, where elements are 4-byte FLOATs, resulting in about 6KB per > record) and using the respective file format's Java APIs: > * We performed a full round trip: writing all vectors to the file and then > reading them back. > * For Parquet we tried several combinations of writing with different > physical type backings (LIST<FLOAT> and FIXED_LEN_BYTE_ARRAY), as well as > relevant encodings(PLAIN, BYTE_STREAM-SPLIT) mentioned in Prateek's ALP doc > [1], and different compressions (SNAPPY, ZSTD, UNCOMPRESSED). We also > disabled dictionary encoding and disabled statistics on the vector > embedding column. Finally we also tuned the row group size to match the > file size (effectively one row group) and the page size to the size of one > vector embedding, as mentioned in Julien's blog [2] and the blog you shared > above from Xiangpeng [3]. > * For Lance we opted to use vanilla settings based on its claims of > already handling vectors optimally. Under the hood my understanding is that > Lance uses Apache Arrow's FixedSizeList for vectors. > * We performed 5 warmup rounds and 10 measurement rounds and collected the > averages below. > * The experiment was conducted on a local machine's file system as a quick > test to get initial signals. > > An initial summary of the results: > * Parquet LIST (byte-stream-split, ZSTD) had the most compact file size > but the difference compared to other combinations was minimal. > * Parquet FIXED performed better than PARQUET LIST by a wide margin in all > combinations. > * Lance was the fastest overall on writes but not by a large margin > compared to Parquet FIXED. > * Parquet FIXED (byte-stream-split, UNCOMPRESSED) was fastest on reads > across all combinations. > I have attached a gist here for others to view the full results: > https://gist.github.com/rahil-c/066f689f91cdb91204a3fb4a9f2aefac > > Regarding the original FIXED_SIZE_LIST logical type PR > <https://github.com/apache/parquet-format/pull/241/changes#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR292>, > backing it with Parquet's primitive FIXED_LEN_BYTE_ARRAY makes sense to me. > As long as you know the vector's dimension (D) and the element type (such > as float32), you can allocate (D * 4) bytes. > I am curious if anyone in the community plans to revisit this or if this > is open for volunteers? > > Links: > 1. > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0#heading=h.5xf60mx6q7xk > 2. https://sympathetic.ink/2025/12/11/Column-Storage-for-the-AI-era.html > 3. > https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion > > Regards, > Rahil Chertara > > On Sat, Feb 21, 2026 at 11:05 AM Micah Kornfield <[email protected]> > wrote: > >> FWIW, I ran across >> >> https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion/ >> which has different recommendations specifically for search but it seems >> to >> confirm some of your thoughts. >> >> Cheers, >> Micah >> >> On Thursday, February 19, 2026, Micah Kornfield <[email protected]> >> wrote: >> >> > 1. Since Parquet does not have a logical VECTOR type, what data type >> does >> >> the community recommend for writing vectors? My assumption is that most >> >> users today would try parquet's LIST with FLOAT but are there other >> ways >> >> to >> >> represent this better? Additionally, are there plans to add a VECTOR >> type >> >> to Parquet in the future? >> > >> > >> > If the lists are fixed size, and you have the metadata stored someplace >> > externally, then using just FLOAT would be better (there is also the >> > logical type float16) which could be useful. There is a stale >> proposal to >> > support FIXED_SIZE_LIST which if someone has bandwidth we should maybe >> > revisit [1] >> > >> > 2. Since vectors have high cardinality, encodings such as DICTIONARY or >> RLE >> >> might not be as useful. Is there a recommended encoding for users to >> >> leverage today? >> > >> > I've heard anecdotally, the BYTE_STREAM_SPLIT + compression can work >> > pretty well with embedding data, but don't have first hand experience. >> I >> > would expect DICTIONARY/RLE would fall back pretty quickly to plain for >> > this type of data. ALP I think also has a proposed encoding for >> handling >> > more scientific like data. I think Prateek might be considering adding >> as >> > a follow-up (its at least been mentioned). >> > >> > 3. Is there a recommendation for tuning row group and page size for >> >> vectors? For example is it always safe to set the row group size to one >> >> per >> >> file and the page size to the size of one vector embedding record? >> > >> > >> > I don't have anything concrete here, but 1 page per vector feels small >> to >> > me. I'd imagine you would at least want to pack O(100 KB) if not more >> into >> > a page. >> > >> > 4. In general should users disable stats on these vector columns? >> > >> > >> > Yes, I don't think stats are particularly useful here. >> > >> > 5. Is there a recommended compression codec for vectors or should they >> >> generally be kept as uncompressed? If vector embeddings should be kept >> >> uncompressed, then for parquet-java I believe we will need to allow per >> >> column compression https://github.com/apache/parquet-java/pull/3396. >> > >> > >> > As mentioned above. I'd first try byte_stream_split + compression. I >> > think being able to turn compression on/off per column is likely useful >> > anyways given the other light-weight encodings we've been exploring. >> > Thanks for the contribution. I will try to do a first pass review but >> would >> > be great if someone more familiar with the java implementation could >> help. >> > >> > Cheers, >> > Micah >> > >> > >> > >> > >> > >> > [1] https://github.com/apache/parquet-format/pull/241 >> > >> > On Thu, Feb 19, 2026 at 2:42 PM Rahil C <[email protected]> wrote: >> > >> >> Hi Parquet community hope all is well, >> >> >> >> My name is Rahil Chertara, and I am an engineer working on open table >> >> formats. I wanted to ask the community how to better configure Parquet >> >> currently for vector storage and retrieval. >> >> >> >> Typically from what I've seen, most major models generate vector >> >> embeddings >> >> as an array of floating point values, with dimensions around 700-1500 >> >> elements (taking up about 3KB–6KB per vector) >> >> https://developers.openai.com/api/docs/guides/embeddings >> >> https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings >> >> >> >> So my questions will be based on the above input. >> >> >> >> 1. Since Parquet does not have a logical VECTOR type, what data type >> does >> >> the community recommend for writing vectors? My assumption is that most >> >> users today would try parquet's LIST with FLOAT but are there other >> ways >> >> to >> >> represent this better? Additionally, are there plans to add a VECTOR >> type >> >> to Parquet in the future? >> >> >> >> 2. Since vectors have high cardinality, encodings such as DICTIONARY or >> >> RLE >> >> might not be as useful. Is there a recommended encoding for users to >> >> leverage today? >> >> >> >> 3. Is there a recommendation for tuning row group and page size for >> >> vectors? For example is it always safe to set the row group size to one >> >> per >> >> file and the page size to the size of one vector embedding record? >> >> >> >> 4. In general should users disable stats on these vector columns? >> >> >> >> 5. Is there a recommended compression codec for vectors or should they >> >> generally be kept as uncompressed? If vector embeddings should be kept >> >> uncompressed, then for parquet-java I believe we will need to allow per >> >> column compression https://github.com/apache/parquet-java/pull/3396. >> >> >> >> Thanks again for your assistance and help. >> >> >> >> Regards, >> >> Rahil Chertara >> >> >> > >> >
