I would agree it's a bit of both. The metadata overhead (per data volume) doesn't increase when you have fewer files. That being said, you could use fewer of the metadata features in that use case if the goal is to exchange well formed data without ambiguity. For wide schema it would be useful to not have to read metadata for columns you are not reading.
On Wed, May 22, 2024 at 9:26 AM Rok Mihevc <rok.mih...@gmail.com> wrote: > I have worked in small data science/engineering teams where time to do > engineering is often a luxury and ad hoc data transformations and analysis > are the norm. In such environments a format that requires a catalog for > efficient reads will be less effective than one that comes with batteries > and good defaults included. > > Aside: a nice view into ad hoc parque workloads in the wild are kaggle > forums [1]. > > [1] https://www.kaggle.com/search?q=parquet > > Rok > > On Wed, May 22, 2024 at 12:43 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > From my perspective I think the answer is more or less both. Even with > > only the data lake use-case we see a wide variety of files on what people > > would be considered to be pushing reasonable boundaries. To some extent > > these might be solvable by having libraries have better defaults (e.g. > only > > collecting/writing statistics by default for the first N columns). > > > > > > > > On Tue, May 21, 2024 at 12:56 PM Steve Loughran > > <ste...@cloudera.com.invalid> > > wrote: > > > > > I wish people would use avro over CSV. Not just for the schema or more > > > complex structures, but because the parser recognises corrupt files. > Oh, > > > and the well defined serialization formats for things like "string" and > > > "number" > > > > > > that said, I generate CSV in test/utility code because it is trivial do > > it > > > and then feed straight into a spreadsheet -I'm not trying to use it for > > > interchange > > > > > > On Sat, 18 May 2024 at 17:10, Curt Hagenlocher <c...@hagenlocher.org> > > > wrote: > > > > > > > While CSV is still the undisputed monarch of exchanging data via > files, > > > > Parquet is arguably "top 3" -- and this is a scenario in which the > file > > > > does really need to be self-contained. > > > > > > > > On Sat, May 18, 2024 at 9:01 AM Raphael Taylor-Davies > > > > <r.taylordav...@googlemail.com.invalid> wrote: > > > > > > > > > Hi Fokko, > > > > > > > > > > I am aware of catalogs such as iceberg, my question was if in the > > > design > > > > > of parquet we can assume the existence of such a catalog. > > > > > > > > > > Kind Regards, > > > > > > > > > > Raphael > > > > > > > > > > On 18 May 2024 16:18:22 BST, Fokko Driesprong <fo...@apache.org> > > > wrote: > > > > > >Hey Raphael, > > > > > > > > > > > >Thanks for reaching out here. Have you looked into table formats > > such > > > as > > > > > Apache > > > > > >Iceberg <https://iceberg.apache.org/docs/nightly/>? This seems to > > fix > > > > the > > > > > >problem that you're describing > > > > > > > > > > > >A table format adds an ACID layer to the file format and acts as a > > > fully > > > > > >functional database. In the case of Iceberg, a catalog is required > > for > > > > > >atomicity, and alternatives like Delta Lake also seem to trend > into > > > that > > > > > >direction > > > > > >< > > > > > > > > > > > > > > > https://github.com/orgs/delta-io/projects/10/views/1?pane=issue&itemId=57584023 > > > > > > > > > > > >. > > > > > > > > > > > >I'm conscious that for many users this responsibility is instead > > > > delegated > > > > > >> to a catalog that maintains its own index structures and > > statistics, > > > > > only relies > > > > > >> on the parquet metadata for very late stage pruning, and may > > > therefore > > > > > >> see limited benefit from revisiting the parquet metadata > > structures. > > > > > > > > > > > > > > > > > >This is exactly what Iceberg offers, it provides additional > metadata > > > to > > > > > >speed up the planning process: > > > > > >https://iceberg.apache.org/docs/nightly/performance/ > > > > > > > > > > > >Kind regards, > > > > > >Fokko > > > > > > > > > > > >Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies > > > > > ><r.taylordav...@googlemail.com.invalid>: > > > > > > > > > > > >> Hi All, > > > > > >> > > > > > >> The recent discussions about metadata make me wonder where a > > storage > > > > > >> format ends and a database begins, as people seem to have > > differing > > > > > >> expectations of parquet here. In particular, one school of > thought > > > > > >> posits that parquet should suffice as a standalone technology, > > where > > > > > >> users can write parquet files to a store and efficiently query > > them > > > > > >> directly with no additional technologies. However, others > instead > > > view > > > > > >> parquet as a storage format for use in conjunction with some > sort > > of > > > > > >> catalog / metastore. These two approaches naturally place very > > > > different > > > > > >> demands on the parquet format. The former case incentivizes > > > > constructing > > > > > >> extremely large parquet files, potentially on the order of TBs > > [1], > > > > such > > > > > >> that the parquet metadata alone can efficiently be used to > > service a > > > > > >> query without lots of random I/O to separate files. However, the > > > > latter > > > > > >> case incentivizes relatively small parquet files (< 1GB) laid > out > > in > > > > > >> such a way that the catalog metadata can be used to efficiently > > > > identify > > > > > >> a much smaller set of files for a given query, and write > > > amplification > > > > > >> can be avoided for inserts. > > > > > >> > > > > > >> Having only ever used parquet in the context of data lake style > > > > systems, > > > > > >> the catalog approach comes more naturally to me and plays to > > > parquet's > > > > > >> current strengths, however, this does not seem to be a > universally > > > > held > > > > > >> expectation. I've frequently found people surprised when queries > > > > > >> performed in the absence of a catalog are slow, or who wish to > > > > > >> efficiently mutate or append to parquet files in place [2] [3] > > [4]. > > > It > > > > > >> is possibly anecdotal but these expectations seem to be more > > common > > > > > >> where people are coming from python-based tooling such as > pandas, > > > and > > > > > >> might reflect weaker tooling support for catalog systems in this > > > > > ecosystem. > > > > > >> > > > > > >> Regardless this mismatch appears to be at the core of at least > > some > > > of > > > > > >> the discussions about metadata. I do not think it a > controversial > > > take > > > > > >> that the current metadata structures are simply not setup for > > files > > > on > > > > > >> the order of >1TB, where the metadata balloons to 10s or 100s of > > MB > > > > and > > > > > >> takes 10s of milliseconds just to parse. If this is in scope it > > > would > > > > > >> justify major changes to the parquet metadata, however, I'm > > > conscious > > > > > >> that for many users this responsibility is instead delegated to > a > > > > > >> catalog that maintains its own index structures and statistics, > > only > > > > > >> relies on the parquet metadata for very late stage pruning, and > > may > > > > > >> therefore see limited benefit from revisiting the parquet > metadata > > > > > >> structures. > > > > > >> > > > > > >> I'd be very interested to hear other people's thoughts on this. > > > > > >> > > > > > >> Kind Regards, > > > > > >> > > > > > >> Raphael > > > > > >> > > > > > >> [1]: https://github.com/apache/arrow-rs/issues/5770 > > > > > >> [2]: https://github.com/apache/datafusion/issues/9654 > > > > > >> [3]: > > > > > >> > > > > > > https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53 > > > > > >> [4]: https://github.com/apache/arrow-rs/issues/557 > > > > > >> > > > > > >> > > > > > > > > > > > > > > >