[GitHub] [iceberg] RussellSpitzer commented on pull request #4674: Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions

GitBox Wed, 11 May 2022 08:55:32 -0700


RussellSpitzer commented on PR #4674:
URL: https://github.com/apache/iceberg/pull/4674#issuecomment-1123959424


   > @RussellSpitzer, @szehon-ho, @aokolnychyi: PR is ready for review.
   > 
   > a) I tested with local file system and local test case in IDE, without 
cache (22ms), with cache (18ms) with 1000 manifest list files to read. Not so 
much difference, probably because it is local file system. As PR avoids reading 
manifest list files again (verified with logs), in S3 file system probably 
better difference can be seen with huge number of snapshots.
   > 
   > ```
   > @Test
   >   public void testExpireSnapshotUsingNamedArgs() {
   >     sql("CREATE TABLE %s (id bigint NOT NULL, data string) USING iceberg", 
tableName);
   > 
   >     for (int i = 0; i < 1000; i++) {
   >       sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
   >     }
   > 
   >     Table table = validationCatalog.loadTable(tableIdent);
   > 
   >     Assert.assertEquals("Should be 1000 snapshots", 1000, 
Iterables.size(table.snapshots()));
   > 
   >     waitUntilAfter(table.currentSnapshot().timestampMillis());
   > 
   >     Timestamp currentTimestamp = 
Timestamp.from(Instant.ofEpochMilli(System.currentTimeMillis()));
   > 
   >     long time = System.currentTimeMillis();
   > 
   >     List<Object[]> output = sql(
   >         "CALL %s.system.expire_snapshots(" +
   >             "older_than => TIMESTAMP '%s'," +
   >             "table => '%s'," +
   >             "retain_last => 999, use_caching => false)",
   >         catalogName, currentTimestamp, tableIdent);
   > 
   >     System.out.println("time taken in ms: " + (System.currentTimeMillis() 
- time));
   >   }
   > 
   >   time taken in ms: 18478  -- with cache
   >   time taken in ms: 22589  -- without cache
   > ```
   > 
   > **b) Also in the base code I found we use caching for 
`RewriteManifestsProcedure`, so I used the same pattern and syntax (keeping 
cache as default).** Also found a bug in the base code and raised PR #4722
   > 
   > c) I have some more plans for improving the `expire_snaphots` (different 
idea than #3457), I will work in the subsequent PR.
   > 
   > Idea is that add a new column "from_snapshot_id" while preparing the 
actual files, then filter out (NOT IN filter) the expired snapshot ids rows 
from persisted output (without scanning again) and then same logic of 
df.except() to find the expired files.
   
   The benchmarks really need to be on datasets that take in the "minutes" to 
run if we really want to see the difference. The key would be to setup a test 
with thousands or tens of thousands of manifests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on pull request #4674: Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions

Reply via email to