Re: [DISCUSS] schema_index

Alkis Evlogimenos Wed, 12 Jun 2024 00:07:07 -0700

>
> By the way, when using Flatbuffers, I would suggest that you
> optionally call Flatbuffers verification when benchmarking the parsing
> routine.



Good point, I do this by default. The numbers quoted above
include verification.

On Tue, Jun 11, 2024 at 5:44 PM Antoine Pitrou <anto...@python.org> wrote:

> On Wed, 5 Jun 2024 21:09:04 +0200
> Alkis Evlogimenos
> <alkis.evlogime...@databricks.com.INVALID>
> wrote:
> >
> > In practice what we want is things to be performant. Sometimes O(1)
> > matters, sometimes not.
>
> +1, good point :-)
>
> > (3) doing a pass over the metadata to guarantee (4) is O(1) does not fail
> > the goal of being fast as long as the cost of doing (3) is a lot smaller
> > than (1) + (2). In a future version, we would shrink footers by 2x and
> > speed up parsing by 100x. Then the above would look like this:
> >
> > 1. 30ms
> > 2. 50us
> > 3. 100us
> > 4.  100ns/col
>
> By the way, when using Flatbuffers, I would suggest that you
> optionally call Flatbuffers verification when benchmarking the parsing
> routine. This is because, in many cases, it is important to ensure that
> untrusted files cannot wreak havoc (we do fuzz the Parquet C++ reader
> to look out for such issues).
>
> > It still doesn't matter if we do some lightweight postprocessing (3)
> given
> > that fetching is so slow.
>
> Yet, please be aware that not all fetching would happen on an object
> store. Processing Parquet files locally is quite common as well, and in
> this context fetching the footer can be extremely fast (Parquet is
> frequently used as an efficient exchange format for large tabular data
> -- for many people, it is a binary CSV on steroids).
>
> Regards
>
> Antoine.
>
>
>

Re: [DISCUSS] schema_index

Reply via email to