Thanks for working and reporting on this, Rahil! I've updated the FIXED_SIZE_LIST PR with received feedback and reported to the original ML thread [1]. It would be great to bring more scientific data support into Parquet! More feedback from the community would be most welcome!
Best, Rok [1] https://lists.apache.org/thread/xot5f3ghhtc82n1bf0wdl9zqwlrzqks3 On Mon, Mar 2, 2026 at 11:07 PM Rahil C <[email protected]> wrote: > Raising this thread again to see if others in the community have thoughts > on this topic and to discuss potential next steps. > > I believe Rok had an original proposal here > https://github.com/apache/parquet-format/pull/241 for adding a > FIXED_SIZE_LIST logical type, which users can leverage in the future for > writing vector embeddings. > The initial proposed spec makes sense to me so I am wondering if the > community needs help with any other items outside the spec PR? If so, I do > not mind assisting with items in parquet-java. > > Regards, > Rahil Chertara > > On Mon, Feb 23, 2026 at 5:56 PM Rahil C <[email protected]> wrote: > > > Thanks Micah for the helpful pointers and the initial pass on the column > > compression PR <https://github.com/apache/parquet-java/pull/3396>, I > > appreciate it greatly. What you mentioned aligns with some > micro-benchmarks > > I've been running that compare Parquet against the Lance file format for > > writing and reading vectors. > > > > The experiment involved writing 10,000 vectors (each with 1,536 > > dimensions, where elements are 4-byte FLOATs, resulting in about 6KB per > > record) and using the respective file format's Java APIs: > > * We performed a full round trip: writing all vectors to the file and > then > > reading them back. > > * For Parquet we tried several combinations of writing with different > > physical type backings (LIST<FLOAT> and FIXED_LEN_BYTE_ARRAY), as well as > > relevant encodings(PLAIN, BYTE_STREAM-SPLIT) mentioned in Prateek's ALP > doc > > [1], and different compressions (SNAPPY, ZSTD, UNCOMPRESSED). We also > > disabled dictionary encoding and disabled statistics on the vector > > embedding column. Finally we also tuned the row group size to match the > > file size (effectively one row group) and the page size to the size of > one > > vector embedding, as mentioned in Julien's blog [2] and the blog you > shared > > above from Xiangpeng [3]. > > * For Lance we opted to use vanilla settings based on its claims of > > already handling vectors optimally. Under the hood my understanding is > that > > Lance uses Apache Arrow's FixedSizeList for vectors. > > * We performed 5 warmup rounds and 10 measurement rounds and collected > the > > averages below. > > * The experiment was conducted on a local machine's file system as a > quick > > test to get initial signals. > > > > An initial summary of the results: > > * Parquet LIST (byte-stream-split, ZSTD) had the most compact file size > > but the difference compared to other combinations was minimal. > > * Parquet FIXED performed better than PARQUET LIST by a wide margin in > all > > combinations. > > * Lance was the fastest overall on writes but not by a large margin > > compared to Parquet FIXED. > > * Parquet FIXED (byte-stream-split, UNCOMPRESSED) was fastest on reads > > across all combinations. > > I have attached a gist here for others to view the full results: > > https://gist.github.com/rahil-c/066f689f91cdb91204a3fb4a9f2aefac > > > > Regarding the original FIXED_SIZE_LIST logical type PR > > < > https://github.com/apache/parquet-format/pull/241/changes#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR292 > >, > > backing it with Parquet's primitive FIXED_LEN_BYTE_ARRAY makes sense to > me. > > As long as you know the vector's dimension (D) and the element type (such > > as float32), you can allocate (D * 4) bytes. > > I am curious if anyone in the community plans to revisit this or if this > > is open for volunteers? > > > > Links: > > 1. > > > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0#heading=h.5xf60mx6q7xk > > 2. https://sympathetic.ink/2025/12/11/Column-Storage-for-the-AI-era.html > > 3. > > > https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion > > > > Regards, > > Rahil Chertara > > > > On Sat, Feb 21, 2026 at 11:05 AM Micah Kornfield <[email protected]> > > wrote: > > > >> FWIW, I ran across > >> > >> > https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion/ > >> which has different recommendations specifically for search but it seems > >> to > >> confirm some of your thoughts. > >> > >> Cheers, > >> Micah > >> > >> On Thursday, February 19, 2026, Micah Kornfield <[email protected]> > >> wrote: > >> > >> > 1. Since Parquet does not have a logical VECTOR type, what data type > >> does > >> >> the community recommend for writing vectors? My assumption is that > most > >> >> users today would try parquet's LIST with FLOAT but are there other > >> ways > >> >> to > >> >> represent this better? Additionally, are there plans to add a VECTOR > >> type > >> >> to Parquet in the future? > >> > > >> > > >> > If the lists are fixed size, and you have the metadata stored > someplace > >> > externally, then using just FLOAT would be better (there is also the > >> > logical type float16) which could be useful. There is a stale > >> proposal to > >> > support FIXED_SIZE_LIST which if someone has bandwidth we should maybe > >> > revisit [1] > >> > > >> > 2. Since vectors have high cardinality, encodings such as DICTIONARY > or > >> RLE > >> >> might not be as useful. Is there a recommended encoding for users to > >> >> leverage today? > >> > > >> > I've heard anecdotally, the BYTE_STREAM_SPLIT + compression can work > >> > pretty well with embedding data, but don't have first hand experience. > >> I > >> > would expect DICTIONARY/RLE would fall back pretty quickly to plain > for > >> > this type of data. ALP I think also has a proposed encoding for > >> handling > >> > more scientific like data. I think Prateek might be considering > adding > >> as > >> > a follow-up (its at least been mentioned). > >> > > >> > 3. Is there a recommendation for tuning row group and page size for > >> >> vectors? For example is it always safe to set the row group size to > one > >> >> per > >> >> file and the page size to the size of one vector embedding record? > >> > > >> > > >> > I don't have anything concrete here, but 1 page per vector feels small > >> to > >> > me. I'd imagine you would at least want to pack O(100 KB) if not more > >> into > >> > a page. > >> > > >> > 4. In general should users disable stats on these vector columns? > >> > > >> > > >> > Yes, I don't think stats are particularly useful here. > >> > > >> > 5. Is there a recommended compression codec for vectors or should they > >> >> generally be kept as uncompressed? If vector embeddings should be > kept > >> >> uncompressed, then for parquet-java I believe we will need to allow > per > >> >> column compression https://github.com/apache/parquet-java/pull/3396. > >> > > >> > > >> > As mentioned above. I'd first try byte_stream_split + compression. I > >> > think being able to turn compression on/off per column is likely > useful > >> > anyways given the other light-weight encodings we've been exploring. > >> > Thanks for the contribution. I will try to do a first pass review but > >> would > >> > be great if someone more familiar with the java implementation could > >> help. > >> > > >> > Cheers, > >> > Micah > >> > > >> > > >> > > >> > > >> > > >> > [1] https://github.com/apache/parquet-format/pull/241 > >> > > >> > On Thu, Feb 19, 2026 at 2:42 PM Rahil C <[email protected]> wrote: > >> > > >> >> Hi Parquet community hope all is well, > >> >> > >> >> My name is Rahil Chertara, and I am an engineer working on open table > >> >> formats. I wanted to ask the community how to better configure > Parquet > >> >> currently for vector storage and retrieval. > >> >> > >> >> Typically from what I've seen, most major models generate vector > >> >> embeddings > >> >> as an array of floating point values, with dimensions around 700-1500 > >> >> elements (taking up about 3KB–6KB per vector) > >> >> https://developers.openai.com/api/docs/guides/embeddings > >> >> > https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings > >> >> > >> >> So my questions will be based on the above input. > >> >> > >> >> 1. Since Parquet does not have a logical VECTOR type, what data type > >> does > >> >> the community recommend for writing vectors? My assumption is that > most > >> >> users today would try parquet's LIST with FLOAT but are there other > >> ways > >> >> to > >> >> represent this better? Additionally, are there plans to add a VECTOR > >> type > >> >> to Parquet in the future? > >> >> > >> >> 2. Since vectors have high cardinality, encodings such as DICTIONARY > or > >> >> RLE > >> >> might not be as useful. Is there a recommended encoding for users to > >> >> leverage today? > >> >> > >> >> 3. Is there a recommendation for tuning row group and page size for > >> >> vectors? For example is it always safe to set the row group size to > one > >> >> per > >> >> file and the page size to the size of one vector embedding record? > >> >> > >> >> 4. In general should users disable stats on these vector columns? > >> >> > >> >> 5. Is there a recommended compression codec for vectors or should > they > >> >> generally be kept as uncompressed? If vector embeddings should be > kept > >> >> uncompressed, then for parquet-java I believe we will need to allow > per > >> >> column compression https://github.com/apache/parquet-java/pull/3396. > >> >> > >> >> Thanks again for your assistance and help. > >> >> > >> >> Regards, > >> >> Rahil Chertara > >> >> > >> > > >> > > >
