[
https://issues.apache.org/jira/browse/IMPALA-14014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17959605#comment-17959605
]
ASF subversion and git services commented on IMPALA-14014:
----------------------------------------------------------
Commit 4640a72a81e49c45fb7629950befdd89edf82381 in impala's branch
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=4640a72a8 ]
IMPALA-14014: Fix COMPUTE STATS with TABLESAMPLE clause
COMPUTE STATS with TABLESAMPLE clause did a full scan on Iceberg
tables since IMPALA-13737, because before this patch ComputeStatsStmt
used FeFsTable.Utils.getFilesSample() which only works correctly on
FS tables that have the file descriptors loaded. Since IMPALA-13737
the internal FS table of an Iceberg table doesn't have file descriptor
information, therefore FeFsTable.Utils.getFilesSample() returned an
empty map which turned off table sampling for COMPUTE STATS.
We did not have proper testing for COMPUTE STATS with table sampling
therefore we did not catch the regression.
This patch adds proper table sampling logic for Iceberg tables that
can be used for COMPUTE STATS. The algorithm previously found in
IcebergScanNode.getFilesSample() has been moved to
FeIcebergTable.Utils.getFilesSample().
Testing
* added e2e tests
Change-Id: Ie59d5fc1374ab69209a74f2488bcb9a7d510b782
Reviewed-on: http://gerrit.cloudera.org:8080/22873
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> TABLESAMPLE of COMPUTE STATS has no effect on Iceberg tables
> ------------------------------------------------------------
>
> Key: IMPALA-14014
> URL: https://issues.apache.org/jira/browse/IMPALA-14014
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Zoltán Borók-Nagy
> Assignee: Zoltán Borók-Nagy
> Priority: Critical
> Labels: impala-iceberg
>
> TABLESAMPLE of COMPUTE STATS has no effect on Iceberg tables.
> E.g. COMPUTE STATS t TABLESAMPLE SYSTEM system(10); scans the whole table to
> calculate statistics, whereas it should only use ~10% of the data.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]