On Wed, 5 Jun 2024 21:09:04 +0200 Alkis Evlogimenos <alkis.evlogime...@databricks.com.INVALID> wrote: > > In practice what we want is things to be performant. Sometimes O(1) > matters, sometimes not.
+1, good point :-) > (3) doing a pass over the metadata to guarantee (4) is O(1) does not fail > the goal of being fast as long as the cost of doing (3) is a lot smaller > than (1) + (2). In a future version, we would shrink footers by 2x and > speed up parsing by 100x. Then the above would look like this: > > 1. 30ms > 2. 50us > 3. 100us > 4. 100ns/col By the way, when using Flatbuffers, I would suggest that you optionally call Flatbuffers verification when benchmarking the parsing routine. This is because, in many cases, it is important to ensure that untrusted files cannot wreak havoc (we do fuzz the Parquet C++ reader to look out for such issues). > It still doesn't matter if we do some lightweight postprocessing (3) given > that fetching is so slow. Yet, please be aware that not all fetching would happen on an object store. Processing Parquet files locally is quite common as well, and in this context fetching the footer can be extremely fast (Parquet is frequently used as an efficient exchange format for large tabular data -- for many people, it is a binary CSV on steroids). Regards Antoine.