Noemi Pap-Takacs has uploaded a new patch set (#13). ( 
http://gerrit.cloudera.org:8080/21388 )

Change subject: IMPALA-12867: Filter files to OPTIMIZE based on file size
......................................................................

IMPALA-12867: Filter files to OPTIMIZE based on file size

The OPTIMIZE TABLE statement is currently used to rewrite the entire
Iceberg table. With the 'FILE_SIZE_THRESHOLD_MB' option, the user can
specify a file size limit to rewrite only small files.

Syntax: OPTIMIZE TABLE <table_name> [(FILE_SIZE_THRESHOLD_MB=<value>)];
The value of the threshold is the file size in MBs. It must be a
non-negative integer. Data files larger than the given limit will only
be rewritten if they are referenced from delete files.
If only 1 file is selected in a partition, it will not be rewritten.
If the threshold is 0, only the delete files and the referenced data
files will be rewritten.

IMPALA-12839: 'Optimizing empty table should be no-op' is also
resolved in this patch.

With the file selection option, the OPTIMIZE operation can operate
in 3 different modes:
- REWRITE_ALL: The entire table is rewritten. Either because the
  compaction was triggered by a simple 'OPTIMIZE TABLE' command
  without a specified 'FILE_SIZE_THRESHOLD_MB' parameter, or
  because all files of the table are deletes/referenced by deletes
  or are smaller than the limit.
- PARTIAL: If the value of 'FILE_SIZE_THRESHOLD_MB' parameter is
  specified then only the small data files without deletes are selected
  and the delete files are merged. Large data files without deletes
  are kept to avoid unnecessary resource consuming writes.
- NOOP: When no files qualify for the selection criteria, there is
  no need to rewrite any files. This is a no-operation.

Testing:
 - Parser test
 - FE unit tests
 - E2E tests

Change-Id: Icfbb589513aacdb68a86c1aec4a0d39b12091820
---
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/service/client-request-state.cc
M common/thrift/CatalogService.thrift
M common/thrift/Query.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java
M fe/src/main/java/org/apache/impala/analysis/TableRef.java
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
A fe/src/main/java/org/apache/impala/util/IcebergOptimizeFileFilter.java
M fe/src/main/jflex/sql-scanner.flex
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
A fe/src/test/java/org/apache/impala/util/IcebergFileFilterTest.java
M testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-optimize.test
M tests/query_test/test_iceberg.py
20 files changed, 788 insertions(+), 49 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/88/21388/13
--
To view, visit http://gerrit.cloudera.org:8080/21388
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfbb589513aacdb68a86c1aec4a0d39b12091820
Gerrit-Change-Number: 21388
Gerrit-PatchSet: 13
Gerrit-Owner: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>

Reply via email to