qphien commented on issue #2093: URL: https://github.com/apache/iceberg/issues/2093#issuecomment-762191244
> You might be able to do this via the various metadata tables, though it would be somewhat complex. https://iceberg.apache.org/spark/#inspecting-tables > > It looks like you could achieve this by joining a table's `manifest` metadata table with the table's , which has a `partitions` column indicating what partition columns have been affected, with the table's `snapshots` table and `history` metadata table. > > There are some examples of joining the two, but essentially you'd want to explode the table's snapshot metadata table on the `manifest_list` column so that you get one row in the expanded snapshots table for each updated / created manifest. That manifest path can be joined with the `path` column in the `manifest` metadata table to then get all of the partitions that are involved in that snapshot. You can find when exactly that snapshot was made current by joining on the `made_current_at` field from the metadata `history` table. Thanks @kbendick for your reply. Yeah, we can join `manifest` with `snapshot` and `history` to get partition create/update time, but this join query is inefficient when there are large number of snapshots, we have to scan all snapshots and manifests. Could we add an additional `create-time` field to `manifest.data_file`? In this case, only latest snapshot and related manifests are needed to scan. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
