[ 
https://issues.apache.org/jira/browse/IMPALA-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy updated IMPALA-11293:
---------------------------------------
    Description: 
Currently Impala cannot compact Iceberg tables.

The following INSERT OVERWRITE statement could be used in the simple cases, 
i.e. when the following conditions met:
 * all data files use the same partition spec (i.e. no partition evolution)
 * no bucket partitioning (we currently forbid INSERT OVERWRITE for bucket 
partitioning)

{noformat}
INSERT OVERWRITE t SELECT * FROM T;{noformat}
We could have a command that compacts the Iceberg table (syntax needs to be the 
same with Hive), e.g.:
{noformat}
ALTER TABLE t EXECUTE compaction();{noformat}
At first, the compact command could be just rewritten to the INSERT OVERWRITE 
command, but it would also check that there's no partition evolution.

The "no bucket" partitioning condition could be relaxed in this case, because 
the result would be deterministic. I.e. the only condition we need to check is 
that there was no partition evolution.

Later, we could do compaction by
{noformat}
TRUNCATE TABLE t;
INSERT INTO t SELECT * FROM t FOR SYSTEM_TIME AS OF ...;{noformat}
Currently time-travel queries are not optimized, but we could workaround it by 
doing planning at first of:
{noformat}
Create the plan for:
TRUNCATE TABLE t;
INSERT INTO t SELECT * FROM t;{noformat}
Then execute them:
{noformat}
Actually execute:
TRUNCATE TABLE t;
INSERT INTO t SELECT * FROM t; (no need for time-travel, plan created before 
TRUNCATE){noformat}
This could workaround the planning overhead of time-travel queries.

Also, we might add some locking for the table if possible.

  was:
Currently Impala cannot convert Iceberg tables.

The following INSERT OVERWRITE statement could be used in the simple cases, 
i.e. when the following conditions met:
 * all data files use the same partition spec (i.e. no partition evolution)
 * no bucket partitioning (we currently forbid INSERT OVERWRITE for bucket 
partitioning)

{noformat}
INSERT OVERWRITE t SELECT * FROM T;{noformat}
We could have a command that compacts the Iceberg table (syntax needs to be the 
same with Hive), e.g.:
{noformat}
ALTER TABLE t EXECUTE compaction();{noformat}
At first, the compact command could be just rewritten to the INSERT OVERWRITE 
command, but it would also check that there's no partition evolution.

The "no bucket" partitioning condition could be relaxed in this case, because 
the result would be deterministic. I.e. the only condition we need to check is 
that there was no partition evolution.

Later, we could do compaction by
{noformat}
TRUNCATE TABLE t;
INSERT INTO t SELECT * FROM t FOR SYSTEM_TIME AS OF ...;{noformat}
Currently time-travel queries are not optimized, but we could workaround it by 
doing planning at first of:
{noformat}
Create the plan for:
TRUNCATE TABLE t;
INSERT INTO t SELECT * FROM t;{noformat}
Then execute them:
{noformat}
Actually execute:
TRUNCATE TABLE t;
INSERT INTO t SELECT * FROM t; (no need for time-travel, plan created before 
TRUNCATE){noformat}
This could workaround the planning overhead of time-travel queries.

Also, we might add some locking for the table if possible.


> Add COMPACT command for Iceberg tables
> --------------------------------------
>
>                 Key: IMPALA-11293
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11293
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> Currently Impala cannot compact Iceberg tables.
> The following INSERT OVERWRITE statement could be used in the simple cases, 
> i.e. when the following conditions met:
>  * all data files use the same partition spec (i.e. no partition evolution)
>  * no bucket partitioning (we currently forbid INSERT OVERWRITE for bucket 
> partitioning)
> {noformat}
> INSERT OVERWRITE t SELECT * FROM T;{noformat}
> We could have a command that compacts the Iceberg table (syntax needs to be 
> the same with Hive), e.g.:
> {noformat}
> ALTER TABLE t EXECUTE compaction();{noformat}
> At first, the compact command could be just rewritten to the INSERT OVERWRITE 
> command, but it would also check that there's no partition evolution.
> The "no bucket" partitioning condition could be relaxed in this case, because 
> the result would be deterministic. I.e. the only condition we need to check 
> is that there was no partition evolution.
> Later, we could do compaction by
> {noformat}
> TRUNCATE TABLE t;
> INSERT INTO t SELECT * FROM t FOR SYSTEM_TIME AS OF ...;{noformat}
> Currently time-travel queries are not optimized, but we could workaround it 
> by doing planning at first of:
> {noformat}
> Create the plan for:
> TRUNCATE TABLE t;
> INSERT INTO t SELECT * FROM t;{noformat}
> Then execute them:
> {noformat}
> Actually execute:
> TRUNCATE TABLE t;
> INSERT INTO t SELECT * FROM t; (no need for time-travel, plan created before 
> TRUNCATE){noformat}
> This could workaround the planning overhead of time-travel queries.
> Also, we might add some locking for the table if possible.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to