Hey Raphael,

Thanks for reaching out here. Have you looked into table formats such as Apache
Iceberg <https://iceberg.apache.org/docs/nightly/>? This seems to fix the
problem that you're describing

A table format adds an ACID layer to the file format and acts as a fully
functional database. In the case of Iceberg, a catalog is required for
atomicity, and alternatives like Delta Lake also seem to trend into that
direction
<https://github.com/orgs/delta-io/projects/10/views/1?pane=issue&itemId=57584023>
.

I'm conscious that for many users this responsibility is instead delegated
> to a catalog that maintains its own index structures and statistics, only 
> relies
> on the parquet metadata for very late stage pruning, and may therefore
> see limited benefit from revisiting the parquet metadata structures.


This is exactly what Iceberg offers, it provides additional metadata to
speed up the planning process:
https://iceberg.apache.org/docs/nightly/performance/

Kind regards,
Fokko

Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid>:

> Hi All,
>
> The recent discussions about metadata make me wonder where a storage
> format ends and a database begins, as people seem to have differing
> expectations of parquet here. In particular, one school of thought
> posits that parquet should suffice as a standalone technology, where
> users can write parquet files to a store and efficiently query them
> directly with no additional technologies. However, others instead view
> parquet as a storage format for use in conjunction with some sort of
> catalog / metastore. These two approaches naturally place very different
> demands on the parquet format. The former case incentivizes constructing
> extremely large parquet files, potentially on the order of TBs [1], such
> that the parquet metadata alone can efficiently be used to service a
> query without lots of random I/O to separate files. However, the latter
> case incentivizes relatively small parquet files (< 1GB) laid out in
> such a way that the catalog metadata can be used to efficiently identify
> a much smaller set of files for a given query, and write amplification
> can be avoided for inserts.
>
> Having only ever used parquet in the context of data lake style systems,
> the catalog approach comes more naturally to me and plays to parquet's
> current strengths, however, this does not seem to be a universally held
> expectation. I've frequently found people surprised when queries
> performed in the absence of a catalog are slow, or who wish to
> efficiently mutate or append to parquet files in place [2] [3] [4]. It
> is possibly anecdotal but these expectations seem to be more common
> where people are coming from python-based tooling such as pandas, and
> might reflect weaker tooling support for catalog systems in this ecosystem.
>
> Regardless this mismatch appears to be at the core of at least some of
> the discussions about metadata. I do not think it a controversial take
> that the current metadata structures are simply not setup for files on
> the order of >1TB, where the metadata balloons to 10s or 100s of MB and
> takes 10s of milliseconds just to parse. If this is in scope it would
> justify major changes to the parquet metadata, however, I'm conscious
> that for many users this responsibility is instead delegated to a
> catalog that maintains its own index structures and statistics, only
> relies on the parquet metadata for very late stage pruning, and may
> therefore see limited benefit from revisiting the parquet metadata
> structures.
>
> I'd be very interested to hear other people's thoughts on this.
>
> Kind Regards,
>
> Raphael
>
> [1]: https://github.com/apache/arrow-rs/issues/5770
> [2]: https://github.com/apache/datafusion/issues/9654
> [3]:
> https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53
> [4]: https://github.com/apache/arrow-rs/issues/557
>
>

Reply via email to