[GitHub] [iceberg] ajantha-bhat commented on pull request #4674: Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions

GitBox Sun, 08 May 2022 05:07:00 -0700


ajantha-bhat commented on PR #4674:
URL: https://github.com/apache/iceberg/pull/4674#issuecomment-1120406659


   @RussellSpitzer, @szehon-ho, @aokolnychyi:  PR is ready for review. 
   
   a) I tested with local file system and local test case in IDE, without cache 
(22ms), with cache (18ms) with 1000 manifest list files to read. 
   Not so much difference, probably because it is local file system. As PR 
avoids reading manifest list files again (verified with logs), in S3 file 
system probably better difference can be seen with huge number of snapshots.
   
   ```
   @Test
     public void testExpireSnapshotUsingNamedArgs() {
       sql("CREATE TABLE %s (id bigint NOT NULL, data string) USING iceberg", 
tableName);
   
       for (int i = 0; i < 1000; i++) {
         sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
       }
   
       Table table = validationCatalog.loadTable(tableIdent);
   
       Assert.assertEquals("Should be 1000 snapshots", 1000, 
Iterables.size(table.snapshots()));
   
       waitUntilAfter(table.currentSnapshot().timestampMillis());
   
       Timestamp currentTimestamp = 
Timestamp.from(Instant.ofEpochMilli(System.currentTimeMillis()));
   
       long time = System.currentTimeMillis();
   
       List<Object[]> output = sql(
           "CALL %s.system.expire_snapshots(" +
               "older_than => TIMESTAMP '%s'," +
               "table => '%s'," +
               "retain_last => 999, use_caching => false)",
           catalogName, currentTimestamp, tableIdent);
   
       System.out.println("time taken in ms: " + (System.currentTimeMillis() - 
time));
     }
   
     time taken in ms: 18478  -- with cache
     time taken in ms: 22589  -- without cache
   ```
     
   **b) Also in the base code I found we use caching for 
`RewriteManifestsProcedure`, so I used the same pattern and syntax (keeping 
cache as default).**
   Also found a bug in the base code and raised PR #4722 
   
   c) I have some more plans for improving the `expire_snaphots` (different 
idea than https://github.com/apache/iceberg/pull/3457), I will work in the 
subsequent PR.
   
   Idea is that add a new column "from_snapshot_id" while preparing the actual 
files, then filter out (NOT IN filter) the expired snapshot ids rows from 
persisted output (without scanning again) and then same logic of df.except() to 
find the expired files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ajantha-bhat commented on pull request #4674: Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions

Reply via email to