I agree that adding a FIXED_SIZE_LIST type to remove the overhead of storing (the same) list length for every element sounds like a clear improvement, and has a natural mapping to the Arrow Fixed Size Byte Array.
On Tue, Mar 3, 2026 at 11:11 AM Rok Mihevc <[email protected]> wrote: > Thanks for working and reporting on this, Rahil! I've updated the > FIXED_SIZE_LIST PR with received feedback and reported to the original ML > thread [1]. It would be great to bring more scientific data support > into Parquet! > More feedback from the community would be most welcome! > > Best, > Rok > > [1] https://lists.apache.org/thread/xot5f3ghhtc82n1bf0wdl9zqwlrzqks3 > > On Mon, Mar 2, 2026 at 11:07 PM Rahil C <[email protected]> wrote: > > > Raising this thread again to see if others in the community have thoughts > > on this topic and to discuss potential next steps. > > > > I believe Rok had an original proposal here > > https://github.com/apache/parquet-format/pull/241 for adding a > > FIXED_SIZE_LIST logical type, which users can leverage in the future for > > writing vector embeddings. > > The initial proposed spec makes sense to me so I am wondering if the > > community needs help with any other items outside the spec PR? If so, I > do > > not mind assisting with items in parquet-java. > > > > Regards, > > Rahil Chertara > > > > On Mon, Feb 23, 2026 at 5:56 PM Rahil C <[email protected]> wrote: > > > > > Thanks Micah for the helpful pointers and the initial pass on the > column > > > compression PR <https://github.com/apache/parquet-java/pull/3396>, I > > > appreciate it greatly. What you mentioned aligns with some > > micro-benchmarks > > > I've been running that compare Parquet against the Lance file format > for > > > writing and reading vectors. > > > > > > The experiment involved writing 10,000 vectors (each with 1,536 > > > dimensions, where elements are 4-byte FLOATs, resulting in about 6KB > per > > > record) and using the respective file format's Java APIs: > > > * We performed a full round trip: writing all vectors to the file and > > then > > > reading them back. > > > * For Parquet we tried several combinations of writing with different > > > physical type backings (LIST<FLOAT> and FIXED_LEN_BYTE_ARRAY), as well > as > > > relevant encodings(PLAIN, BYTE_STREAM-SPLIT) mentioned in Prateek's ALP > > doc > > > [1], and different compressions (SNAPPY, ZSTD, UNCOMPRESSED). We also > > > disabled dictionary encoding and disabled statistics on the vector > > > embedding column. Finally we also tuned the row group size to match the > > > file size (effectively one row group) and the page size to the size of > > one > > > vector embedding, as mentioned in Julien's blog [2] and the blog you > > shared > > > above from Xiangpeng [3]. > > > * For Lance we opted to use vanilla settings based on its claims of > > > already handling vectors optimally. Under the hood my understanding is > > that > > > Lance uses Apache Arrow's FixedSizeList for vectors. > > > * We performed 5 warmup rounds and 10 measurement rounds and collected > > the > > > averages below. > > > * The experiment was conducted on a local machine's file system as a > > quick > > > test to get initial signals. > > > > > > An initial summary of the results: > > > * Parquet LIST (byte-stream-split, ZSTD) had the most compact file size > > > but the difference compared to other combinations was minimal. > > > * Parquet FIXED performed better than PARQUET LIST by a wide margin in > > all > > > combinations. > > > * Lance was the fastest overall on writes but not by a large margin > > > compared to Parquet FIXED. > > > * Parquet FIXED (byte-stream-split, UNCOMPRESSED) was fastest on reads > > > across all combinations. > > > I have attached a gist here for others to view the full results: > > > https://gist.github.com/rahil-c/066f689f91cdb91204a3fb4a9f2aefac > > > > > > Regarding the original FIXED_SIZE_LIST logical type PR > > > < > > > https://github.com/apache/parquet-format/pull/241/changes#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR292 > > >, > > > backing it with Parquet's primitive FIXED_LEN_BYTE_ARRAY makes sense to > > me. > > > As long as you know the vector's dimension (D) and the element type > (such > > > as float32), you can allocate (D * 4) bytes. > > > I am curious if anyone in the community plans to revisit this or if > this > > > is open for volunteers? > > > > > > Links: > > > 1. > > > > > > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0#heading=h.5xf60mx6q7xk > > > 2. > https://sympathetic.ink/2025/12/11/Column-Storage-for-the-AI-era.html > > > 3. > > > > > > https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion > > > > > > Regards, > > > Rahil Chertara > > > > > > On Sat, Feb 21, 2026 at 11:05 AM Micah Kornfield < > [email protected]> > > > wrote: > > > > > >> FWIW, I ran across > > >> > > >> > > > https://blog.xiangpeng.systems/posts/vector-search-with-parquet-datafusion/ > > >> which has different recommendations specifically for search but it > seems > > >> to > > >> confirm some of your thoughts. > > >> > > >> Cheers, > > >> Micah > > >> > > >> On Thursday, February 19, 2026, Micah Kornfield < > [email protected]> > > >> wrote: > > >> > > >> > 1. Since Parquet does not have a logical VECTOR type, what data type > > >> does > > >> >> the community recommend for writing vectors? My assumption is that > > most > > >> >> users today would try parquet's LIST with FLOAT but are there other > > >> ways > > >> >> to > > >> >> represent this better? Additionally, are there plans to add a > VECTOR > > >> type > > >> >> to Parquet in the future? > > >> > > > >> > > > >> > If the lists are fixed size, and you have the metadata stored > > someplace > > >> > externally, then using just FLOAT would be better (there is also the > > >> > logical type float16) which could be useful. There is a stale > > >> proposal to > > >> > support FIXED_SIZE_LIST which if someone has bandwidth we should > maybe > > >> > revisit [1] > > >> > > > >> > 2. Since vectors have high cardinality, encodings such as DICTIONARY > > or > > >> RLE > > >> >> might not be as useful. Is there a recommended encoding for users > to > > >> >> leverage today? > > >> > > > >> > I've heard anecdotally, the BYTE_STREAM_SPLIT + compression can work > > >> > pretty well with embedding data, but don't have first hand > experience. > > >> I > > >> > would expect DICTIONARY/RLE would fall back pretty quickly to plain > > for > > >> > this type of data. ALP I think also has a proposed encoding for > > >> handling > > >> > more scientific like data. I think Prateek might be considering > > adding > > >> as > > >> > a follow-up (its at least been mentioned). > > >> > > > >> > 3. Is there a recommendation for tuning row group and page size for > > >> >> vectors? For example is it always safe to set the row group size to > > one > > >> >> per > > >> >> file and the page size to the size of one vector embedding record? > > >> > > > >> > > > >> > I don't have anything concrete here, but 1 page per vector feels > small > > >> to > > >> > me. I'd imagine you would at least want to pack O(100 KB) if not > more > > >> into > > >> > a page. > > >> > > > >> > 4. In general should users disable stats on these vector columns? > > >> > > > >> > > > >> > Yes, I don't think stats are particularly useful here. > > >> > > > >> > 5. Is there a recommended compression codec for vectors or should > they > > >> >> generally be kept as uncompressed? If vector embeddings should be > > kept > > >> >> uncompressed, then for parquet-java I believe we will need to allow > > per > > >> >> column compression > https://github.com/apache/parquet-java/pull/3396. > > >> > > > >> > > > >> > As mentioned above. I'd first try byte_stream_split + compression. > I > > >> > think being able to turn compression on/off per column is likely > > useful > > >> > anyways given the other light-weight encodings we've been exploring. > > >> > Thanks for the contribution. I will try to do a first pass review > but > > >> would > > >> > be great if someone more familiar with the java implementation could > > >> help. > > >> > > > >> > Cheers, > > >> > Micah > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > [1] https://github.com/apache/parquet-format/pull/241 > > >> > > > >> > On Thu, Feb 19, 2026 at 2:42 PM Rahil C <[email protected]> > wrote: > > >> > > > >> >> Hi Parquet community hope all is well, > > >> >> > > >> >> My name is Rahil Chertara, and I am an engineer working on open > table > > >> >> formats. I wanted to ask the community how to better configure > > Parquet > > >> >> currently for vector storage and retrieval. > > >> >> > > >> >> Typically from what I've seen, most major models generate vector > > >> >> embeddings > > >> >> as an array of floating point values, with dimensions around > 700-1500 > > >> >> elements (taking up about 3KB–6KB per vector) > > >> >> https://developers.openai.com/api/docs/guides/embeddings > > >> >> > > https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings > > >> >> > > >> >> So my questions will be based on the above input. > > >> >> > > >> >> 1. Since Parquet does not have a logical VECTOR type, what data > type > > >> does > > >> >> the community recommend for writing vectors? My assumption is that > > most > > >> >> users today would try parquet's LIST with FLOAT but are there other > > >> ways > > >> >> to > > >> >> represent this better? Additionally, are there plans to add a > VECTOR > > >> type > > >> >> to Parquet in the future? > > >> >> > > >> >> 2. Since vectors have high cardinality, encodings such as > DICTIONARY > > or > > >> >> RLE > > >> >> might not be as useful. Is there a recommended encoding for users > to > > >> >> leverage today? > > >> >> > > >> >> 3. Is there a recommendation for tuning row group and page size for > > >> >> vectors? For example is it always safe to set the row group size to > > one > > >> >> per > > >> >> file and the page size to the size of one vector embedding record? > > >> >> > > >> >> 4. In general should users disable stats on these vector columns? > > >> >> > > >> >> 5. Is there a recommended compression codec for vectors or should > > they > > >> >> generally be kept as uncompressed? If vector embeddings should be > > kept > > >> >> uncompressed, then for parquet-java I believe we will need to allow > > per > > >> >> column compression > https://github.com/apache/parquet-java/pull/3396. > > >> >> > > >> >> Thanks again for your assistance and help. > > >> >> > > >> >> Regards, > > >> >> Rahil Chertara > > >> >> > > >> > > > >> > > > > > >
