> > So, if I understand correctly, you have a small number of key-value > metadata entries, but the values may be large?
Values are likely proportional to the number of schema elements (but there are only a few of them). > Also, you actually need those metadata values to do anything with the > data (because they tell you the actual Iceberg schema), so on-demand > decoding of these values would probably not help for you? No, this is mostly for debugging. Schema history should always be stored at a higher level, and the only thing in theory that should be needed are the field_ids which are stored as part of the schema for eads. On Friday, May 17, 2024, Antoine Pitrou <anto...@python.org> wrote: > > Hi Fokko, > > So, if I understand correctly, you have a small number of key-value > metadata entries, but the values may be large? > > Also, you actually need those metadata values to do anything with the > data (because they tell you the actual Iceberg schema), so on-demand > decoding of these values would probably not help for you? > > (I'm not sure large string values are a problem with Thrift; I would > hope not) > > Regards > > Antoine. > > > On Thu, 16 May 2024 22:45:02 +0200 > Fokko Driesprong <fo...@apache.org> wrote: > > Hey Antoine, > > > > First of all, love the recent uptake in activity on the Parquet side. I'm > > on holiday, but I'll definitly catch up when I return. > > > > I wanted to respond to this particular mail since we do store various > > fields in the metadata for Apache Iceberg. For example: > > > > - The JSON serialized Iceberg schema that was used when writing the > > file: > > > https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L274 > > - I > > < > https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L274 > >n > > the case of delete files, we write the kind of file (positional or > > equality), and in the case of equality, also the field IDs: > > > https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L905-L910 > > > > This is mostly for debugging purposes. The schema could become quite big > as > > it is proportional to the number of columns. The metadata is mostly set > for > > debugging purposes and is not part of the official Iceberg spec. > > > > I hope this helps! > > > > Kind regards, > > Fokko > > > > Op do 16 mei 2024 om 21:17 schreef Antoine Pitrou <anto...@python.org>: > > > > > > > > Hello, > > > > > > In https://github.com/apache/parquet-format/pull/242 the question came > > > of the size and overhead of key-value metadata entries in real world > > > Parquet files (basically, user-defined metadata attached either to the > > > entire file or to individual columns). Do people have insight to share > > > about the typical number of metadata entries in a file or column, and > > > their typical byte size? > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > > > > >