Hi Everyone,

I am writing this to let all of you know about the proposal
https://github.com/apache/iceberg/issues/13855 being discussed about the
lack of column written information in file and to further discuss on this
topic to reach a conclusion.

Problem Statement & Background:

Initiated a proposal with the idea of linking schema id with file so that
columns written can be extracted using this link. But then, it turns out
that this link alone won't help us in finding the columns as columns could
be optional at times. Ideally, we want to know all columns written in each
file to know whether the specific one or more columns has been used or not
in that specific file or not later as part of the evaluation process to
decide whether to skip the file or not.

"Columns written" metric could be added as a metric to contain this
information. It could have all field ids used in the file. Using this, the
decision to skip the file or not based on the field id used in Expression
can be made by doing a lookup on the "Columns written" set.

Please share your thoughts on this proposal.

Thanks,
Mani

Reply via email to