RussellSpitzer commented on PR #4674:
URL: https://github.com/apache/iceberg/pull/4674#issuecomment-1123959424
> @RussellSpitzer, @szehon-ho, @aokolnychyi: PR is ready for review.
>
> a) I tested with local file system and local test case in IDE, without
cache (22ms), with cache (18ms) with 1000 manifest list files to read. Not so
much difference, probably because it is local file system. As PR avoids reading
manifest list files again (verified with logs), in S3 file system probably
better difference can be seen with huge number of snapshots.
>
> ```
> @Test
> public void testExpireSnapshotUsingNamedArgs() {
> sql("CREATE TABLE %s (id bigint NOT NULL, data string) USING iceberg",
tableName);
>
> for (int i = 0; i < 1000; i++) {
> sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
> }
>
> Table table = validationCatalog.loadTable(tableIdent);
>
> Assert.assertEquals("Should be 1000 snapshots", 1000,
Iterables.size(table.snapshots()));
>
> waitUntilAfter(table.currentSnapshot().timestampMillis());
>
> Timestamp currentTimestamp =
Timestamp.from(Instant.ofEpochMilli(System.currentTimeMillis()));
>
> long time = System.currentTimeMillis();
>
> List<Object[]> output = sql(
> "CALL %s.system.expire_snapshots(" +
> "older_than => TIMESTAMP '%s'," +
> "table => '%s'," +
> "retain_last => 999, use_caching => false)",
> catalogName, currentTimestamp, tableIdent);
>
> System.out.println("time taken in ms: " + (System.currentTimeMillis()
- time));
> }
>
> time taken in ms: 18478 -- with cache
> time taken in ms: 22589 -- without cache
> ```
>
> **b) Also in the base code I found we use caching for
`RewriteManifestsProcedure`, so I used the same pattern and syntax (keeping
cache as default).** Also found a bug in the base code and raised PR #4722
>
> c) I have some more plans for improving the `expire_snaphots` (different
idea than #3457), I will work in the subsequent PR.
>
> Idea is that add a new column "from_snapshot_id" while preparing the
actual files, then filter out (NOT IN filter) the expired snapshot ids rows
from persisted output (without scanning again) and then same logic of
df.except() to find the expired files.
The benchmarks really need to be on datasets that take in the "minutes" to
run if we really want to see the difference. The key would be to setup a test
with thousands or tens of thousands of manifests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]