> > By the way, when using Flatbuffers, I would suggest that you > optionally call Flatbuffers verification when benchmarking the parsing > routine.
Good point, I do this by default. The numbers quoted above include verification. On Tue, Jun 11, 2024 at 5:44 PM Antoine Pitrou <anto...@python.org> wrote: > On Wed, 5 Jun 2024 21:09:04 +0200 > Alkis Evlogimenos > <alkis.evlogime...@databricks.com.INVALID> > wrote: > > > > In practice what we want is things to be performant. Sometimes O(1) > > matters, sometimes not. > > +1, good point :-) > > > (3) doing a pass over the metadata to guarantee (4) is O(1) does not fail > > the goal of being fast as long as the cost of doing (3) is a lot smaller > > than (1) + (2). In a future version, we would shrink footers by 2x and > > speed up parsing by 100x. Then the above would look like this: > > > > 1. 30ms > > 2. 50us > > 3. 100us > > 4. 100ns/col > > By the way, when using Flatbuffers, I would suggest that you > optionally call Flatbuffers verification when benchmarking the parsing > routine. This is because, in many cases, it is important to ensure that > untrusted files cannot wreak havoc (we do fuzz the Parquet C++ reader > to look out for such issues). > > > It still doesn't matter if we do some lightweight postprocessing (3) > given > > that fetching is so slow. > > Yet, please be aware that not all fetching would happen on an object > store. Processing Parquet files locally is quite common as well, and in > this context fetching the footer can be extremely fast (Parquet is > frequently used as an efficient exchange format for large tabular data > -- for many people, it is a binary CSV on steroids). > > Regards > > Antoine. > > >