>
> So, if I understand correctly, you have a small number of key-value
> metadata entries, but the values may be large?


Values are likely proportional to the number of schema elements (but there
are only a few of them).



> Also, you actually need those metadata values to do anything with the
> data (because they tell you the actual Iceberg schema), so on-demand
> decoding of these values would probably not help for you?


No, this is mostly for debugging.  Schema history should always be stored
at a higher level, and the only thing in theory that should be needed are
the field_ids which are stored as part of the schema for eads.



On Friday, May 17, 2024, Antoine Pitrou <anto...@python.org> wrote:

>
> Hi Fokko,
>
> So, if I understand correctly, you have a small number of key-value
> metadata entries, but the values may be large?
>
> Also, you actually need those metadata values to do anything with the
> data (because they tell you the actual Iceberg schema), so on-demand
> decoding of these values would probably not help for you?
>
> (I'm not sure large string values are a problem with Thrift; I would
> hope not)
>
> Regards
>
> Antoine.
>
>
> On Thu, 16 May 2024 22:45:02 +0200
> Fokko Driesprong <fo...@apache.org> wrote:
> > Hey Antoine,
> >
> > First of all, love the recent uptake in activity on the Parquet side. I'm
> > on holiday, but I'll definitly catch up when I return.
> >
> > I wanted to respond to this particular mail since we do store various
> > fields in the metadata for Apache Iceberg. For example:
> >
> >    - The JSON serialized Iceberg schema that was used when writing the
> >    file:
> >
> https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L274
> >    - I
> >    <
> https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L274
> >n
> >    the case of delete files, we write the kind of file (positional or
> >    equality), and in the case of equality, also the field IDs:
> >
> https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L905-L910
> >
> > This is mostly for debugging purposes. The schema could become quite big
> as
> > it is proportional to the number of columns. The metadata is mostly set
> for
> > debugging purposes and is not part of the official Iceberg spec.
> >
> > I hope this helps!
> >
> > Kind regards,
> > Fokko
> >
> > Op do 16 mei 2024 om 21:17 schreef Antoine Pitrou <anto...@python.org>:
> >
> > >
> > > Hello,
> > >
> > > In https://github.com/apache/parquet-format/pull/242 the question came
> > > of the size and overhead of key-value metadata entries in real world
> > > Parquet files (basically, user-defined metadata attached either to the
> > > entire file or to individual columns). Do people have insight to share
> > > about the typical number of metadata entries in a file or column, and
> > > their typical byte size?
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>
>
>
>

Reply via email to