On Wed, 5 Jun 2024 21:09:04 +0200
Alkis Evlogimenos
<alkis.evlogime...@databricks.com.INVALID>
wrote:
> 
> In practice what we want is things to be performant. Sometimes O(1)
> matters, sometimes not.

+1, good point :-)

> (3) doing a pass over the metadata to guarantee (4) is O(1) does not fail
> the goal of being fast as long as the cost of doing (3) is a lot smaller
> than (1) + (2). In a future version, we would shrink footers by 2x and
> speed up parsing by 100x. Then the above would look like this:
> 
> 1. 30ms
> 2. 50us
> 3. 100us
> 4.  100ns/col

By the way, when using Flatbuffers, I would suggest that you
optionally call Flatbuffers verification when benchmarking the parsing
routine. This is because, in many cases, it is important to ensure that
untrusted files cannot wreak havoc (we do fuzz the Parquet C++ reader
to look out for such issues).

> It still doesn't matter if we do some lightweight postprocessing (3) given
> that fetching is so slow.

Yet, please be aware that not all fetching would happen on an object
store. Processing Parquet files locally is quite common as well, and in
this context fetching the footer can be extremely fast (Parquet is
frequently used as an efficient exchange format for large tabular data
-- for many people, it is a binary CSV on steroids).

Regards

Antoine.


Reply via email to