Hey Raphael, Thanks for reaching out here. Have you looked into table formats such as Apache Iceberg <https://iceberg.apache.org/docs/nightly/>? This seems to fix the problem that you're describing
A table format adds an ACID layer to the file format and acts as a fully functional database. In the case of Iceberg, a catalog is required for atomicity, and alternatives like Delta Lake also seem to trend into that direction <https://github.com/orgs/delta-io/projects/10/views/1?pane=issue&itemId=57584023> . I'm conscious that for many users this responsibility is instead delegated > to a catalog that maintains its own index structures and statistics, only > relies > on the parquet metadata for very late stage pruning, and may therefore > see limited benefit from revisiting the parquet metadata structures. This is exactly what Iceberg offers, it provides additional metadata to speed up the planning process: https://iceberg.apache.org/docs/nightly/performance/ Kind regards, Fokko Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies <r.taylordav...@googlemail.com.invalid>: > Hi All, > > The recent discussions about metadata make me wonder where a storage > format ends and a database begins, as people seem to have differing > expectations of parquet here. In particular, one school of thought > posits that parquet should suffice as a standalone technology, where > users can write parquet files to a store and efficiently query them > directly with no additional technologies. However, others instead view > parquet as a storage format for use in conjunction with some sort of > catalog / metastore. These two approaches naturally place very different > demands on the parquet format. The former case incentivizes constructing > extremely large parquet files, potentially on the order of TBs [1], such > that the parquet metadata alone can efficiently be used to service a > query without lots of random I/O to separate files. However, the latter > case incentivizes relatively small parquet files (< 1GB) laid out in > such a way that the catalog metadata can be used to efficiently identify > a much smaller set of files for a given query, and write amplification > can be avoided for inserts. > > Having only ever used parquet in the context of data lake style systems, > the catalog approach comes more naturally to me and plays to parquet's > current strengths, however, this does not seem to be a universally held > expectation. I've frequently found people surprised when queries > performed in the absence of a catalog are slow, or who wish to > efficiently mutate or append to parquet files in place [2] [3] [4]. It > is possibly anecdotal but these expectations seem to be more common > where people are coming from python-based tooling such as pandas, and > might reflect weaker tooling support for catalog systems in this ecosystem. > > Regardless this mismatch appears to be at the core of at least some of > the discussions about metadata. I do not think it a controversial take > that the current metadata structures are simply not setup for files on > the order of >1TB, where the metadata balloons to 10s or 100s of MB and > takes 10s of milliseconds just to parse. If this is in scope it would > justify major changes to the parquet metadata, however, I'm conscious > that for many users this responsibility is instead delegated to a > catalog that maintains its own index structures and statistics, only > relies on the parquet metadata for very late stage pruning, and may > therefore see limited benefit from revisiting the parquet metadata > structures. > > I'd be very interested to hear other people's thoughts on this. > > Kind Regards, > > Raphael > > [1]: https://github.com/apache/arrow-rs/issues/5770 > [2]: https://github.com/apache/datafusion/issues/9654 > [3]: > https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53 > [4]: https://github.com/apache/arrow-rs/issues/557 > >