rdblue opened a new pull request #805: Add all_data_files, all_manifests, and 
all_entries metadata tables
URL: https://github.com/apache/incubator-iceberg/pull/805
 
 
   This adds 3 new metadata tables and tests:
   * `all_data_files` lists all data files in a table that are accessible from 
any valid (not expired) snapshot
   * `all_entries` lists all manifest entries in a table that are accessible 
from any valid snapshot
   * `all_manifests` lists all manifest files in a table that are accessible 
from any valid snapshot
   
   These tables may contain duplicate rows. Deduplication can't be done through 
the current scan interface unless all of the work is done during scan planning 
on a single node. Duplicates are the trade-off for being able to process the 
metadata in parallel for large tables.
   
   ### Use cases
   
   We recently added the `all_data_files` and `all_manifests` tables to enable 
building services that manage data files. For example, a janitor service that 
cleans up orphaned or dangling data files needs to be able to list all valid 
files in a table. Along with the `snapshots` table that has manifest list 
locations, `all_manifests` and `all_data_files` enable listing all data and 
metadata files referenced by a table.
   
   We use the `all_entries` table to detect the last modified time of 
partitions. This requires knowing when a file was appended or overwritten and 
requires ignoring later rewrites:
   
   ```sql
   SELECT
       max(s.committed_at) as last_updated_at,
       e.data_file.partition.*
   FROM db.table.all_entries e
   JOIN db.table.snapshots s
     ON e.snapshot_id = s.snapshot_id
   WHERE e.status = 1 AND s.operation IN ('append', 'overwrite')
   GROUP BY e.data_file.partition
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to