Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20077 )

Change subject: IMPALA-11013: Support 'MIGRATE TABLE' for external Hive tables
......................................................................


Patch Set 6:

(25 comments)

http://gerrit.cloudera.org:8080/#/c/20077/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20077/5//COMMIT_MSG@9
PS5, Line 9: lega
> I think 'Hive tables' or 'legacy Hive tables' are more precise, as the tabl
Done


http://gerrit.cloudera.org:8080/#/c/20077/5//COMMIT_MSG@11
PS5, Line 11: an external and n
> I think 'non-transactional' is less confusing here.
I don't think that we can use 'non-transactional table' as a synonym for 
'external table'. I get that we create non-transactional tables as external 
tables but the names are for different purposes in my opinion.
Also I think we can achieve easily to make a non-transactional table 
non-external by running and ALTER TABLE query to change the EXTERNAL property.


http://gerrit.cloudera.org:8080/#/c/20077/5//COMMIT_MSG@13
PS5, Line 13:
> 'in a distributed filesystem or object store'
Done


http://gerrit.cloudera.org:8080/#/c/20077/5//COMMIT_MSG@18
PS5, Line 18:
> is
Done


http://gerrit.cloudera.org:8080/#/c/20077/5//COMMIT_MSG@26
PS5, Line 26: The HDFS table to be converted must follow those requirements:
            :      - table is an external table
> From Hive3+ external table == not a transactional table, so the first two p
see my comment above. I think we should still keep the differentiation between 
non-transactional and external tables.


http://gerrit.cloudera.org:8080/#/c/20077/5//COMMIT_MSG@30
PS5, Line 30: 
> missing verb
Done


http://gerrit.cloudera.org:8080/#/c/20077/5//COMMIT_MSG@32
PS5, Line 32: Hive table are re-used and not moved, copied or re-created by this
            : operation. The new Iceber
> Why don't we keep the original value? (though currently Iceberg tables with
On e reason is the bug you mentioned. But also I wanted to replicate what a 
regular CREATE TABLE STORED AS ICEBERG does where it creates the Iceberg table 
with 'external.table.purge'='true'.


http://gerrit.cloudera.org:8080/#/c/20077/5//COMMIT_MSG@42
PS5, Line 42: table name
> Can we use the query id instead?
Would that help for debugging? In a normal scenario the user wouldn't even see 
the tmp table name. If there is an error somewhere in the process, they get a 
hint in the error msg how to reset the table name (this msg contains the tmp 
table name too) so they would know which tmp table is created by the failed 
query.


http://gerrit.cloudera.org:8080/#/c/20077/5//COMMIT_MSG@46
PS5, Line 46:            data of the Hdfs table.
            :  - Step 5 (Optional): For an Iceberg table in Hadoop Tables, run a
            :            CREATE TABLE query to add the Iceberg table to HMS as 
well.
            :  - Step 6 (Optional): For an Iceberg table in Hadoop Tables, set 
the
            :            'exter
> So are these not relevant for tables in the HiveCatalog? Then probably it w
Done


http://gerrit.cloudera.org:8080/#/c/20077/5//COMMIT_MSG@56
PS5, Line 56:  - Add FE UTs
            :  - Manually tested the runtime performance for a table that is
> Could you please also try to convert tpcds.store_sales at a scale of 10?
Sure. I experimented with this and observed that regardless of the scale 
factor, the store+sales table will have around 1800 partitions and one data 
file in each partition. So from the table migration perspective there isn't 
really a difference between tpcds2 and tpcds10 runtimes. Both are around 
15-20sec on my machine with a release build.
Note, as there is one file in each partition we can't take advantage of running 
the table conversion in multiple threads.


http://gerrit.cloudera.org:8080/#/c/20077/5/be/src/service/client-request-state.cc
File be/src/service/client-request-state.cc:

http://gerrit.cloudera.org:8080/#/c/20077/5/be/src/service/client-request-state.cc@2240
PS5, Line 2240: y>
> nit: redundant 'the'
Done


http://gerrit.cloudera.org:8080/#/c/20077/5/be/src/service/client-request-state.cc@2240
PS5, Line 2240: hild_queries;
> Could you add a pointer to the Frontend code that sets these table properti
Done


http://gerrit.cloudera.org:8080/#/c/20077/5/be/src/service/client-request-state.cc@2269
PS5, Line 2269:     {
> I think we should try to revert the table name if the query status is not o
Instead of reverting the table I give specific hints for the user how to revert 
the name themselves. We can still enhance this code to revert the name 
programatically here, but this seemed the easier solution to get this patch 
going.


http://gerrit.cloudera.org:8080/#/c/20077/5/be/src/service/client-request-state.cc@2281
PS5, Line 2281:
> This hides the outer child_queries which can be confusing later, so maybe r
I'm not sure what you mean by outer child_queries. This function has another 
child_queries local variable but it's in a different code block. I haven't 
found any global variable with this name.


http://gerrit.cloudera.org:8080/#/c/20077/5/be/src/service/client-request-state.cc@2283
PS5, Line 2283: RuntimeProfile::Create(&profile_pool_, "C
> Isn't it always set?
It's only set for non-Hive catalogs.


http://gerrit.cloudera.org:8080/#/c/20077/5/be/src/service/client-request-state.cc@2313
PS5, Line 2313:     unique_ptr<ChildQueryExecutor> query_executor(new 
ChildQueryExecutor());
> We should notify the user in case the query status is not ok if the clean u
L2315 is responsible for notifying the user about such failures.


http://gerrit.cloudera.org:8080/#/c/20077/5/fe/src/main/java/org/apache/impala/analysis/ConvertTableToIcebergStmt.java
File fe/src/main/java/org/apache/impala/analysis/ConvertTableToIcebergStmt.java:

http://gerrit.cloudera.org:8080/#/c/20077/5/fe/src/main/java/org/apache/impala/analysis/ConvertTableToIcebergStmt.java@90
PS5, Line 90:
            :     if 
(!MetaStoreUtils.isExternalTable(table.getMetaStoreTable())) {
            :       throw new AnalysisException(
            :           "CONVERT TO ICEBERG is not supported for managed 
tables");
            :     }
            :
            :     if (table.getMetaStoreTable().getParameters() != null &&
            :         
AcidUtils.isTransactionalTable(table.getMetaStoreTable().getParameters())) {
            :       throw new AnalysisException(
            :
> These will be true/false at the same time. Maybe just keep the one about tr
Not necessarily. I can run an ALTER TABLE to change 'EXTERNAL' property to 
false for a non-transactional table.


http://gerrit.cloudera.org:8080/#/c/20077/5/fe/src/main/java/org/apache/impala/analysis/ConvertTableToIcebergStmt.java@115
PS5, Line 115:     }
             :
             :     if (TIcebergCatalog.HADOOP_CATALOG == 
IcebergUtil.getTIcebergCatalog(properties_)) {
             :
> The commit message mentions Hadoop Catalog.
Indeed. I meant Hadoop Tables in the commit msg.


http://gerrit.cloudera.org:8080/#/c/20077/5/fe/src/main/java/org/apache/impala/analysis/ConvertTableToIcebergStmt.java@120
PS5, Line 120:
> Could you please add the link https://github.com/apache/iceberg/issues/7612
Done


http://gerrit.cloudera.org:8080/#/c/20077/5/fe/src/main/java/org/apache/impala/analysis/ConvertTableToIcebergStmt.java@126
PS5, Line 126:         if (!partitionExpr.getType().isStringType()) continue;
             :         String partitionValue = partitionExpr.getStringValue();
             :         if (partitionValue == null) continue;
             :         i
> Shouldn't we check for any UTF-encoded char?
I tested this with other special chars and apparently only '/' char is 
affected. The other special chars caused issued when I was working on a 
possible workaround to pass URL-encoded partition names to Iceberg instead of 
the raw name. There the issue was that we wrote the URL-encoded values into 
metadata and stats instead of the original values. Long story short, when we 
just pass the original partition name to Iceberg, only the '/' char causes 
issues.


http://gerrit.cloudera.org:8080/#/c/20077/5/fe/src/main/java/org/apache/impala/analysis/ConvertTableToIcebergStmt.java@138
PS5, Line 138:
> To ensure that the hdfs table is a pure external table, whether the table i
It is verified in L90.


http://gerrit.cloudera.org:8080/#/c/20077/5/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/20077/5/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@136
PS5, Line 136:
> nit: this way it gives the impression that the above comment is related
Done


http://gerrit.cloudera.org:8080/#/c/20077/5/fe/src/main/java/org/apache/impala/util/MigrateTableUtil.java
File fe/src/main/java/org/apache/impala/util/MigrateTableUtil.java:

http://gerrit.cloudera.org:8080/#/c/20077/5/fe/src/main/java/org/apache/impala/util/MigrateTableUtil.java@60
PS5, Line 60:
> nit: comment
Done


http://gerrit.cloudera.org:8080/#/c/20077/5/fe/src/main/java/org/apache/impala/util/MigrateTableUtil.java@108
PS5, Line 108:     }
             :
             :     try
> Please add comment why we are dropping the table.
Done


http://gerrit.cloudera.org:8080/#/c/20077/5/fe/src/main/java/org/apache/impala/util/MigrateTableUtil.java@235
PS5, Line 235:
> This "migration" action, I think, it is IO-intensive, so the concurrency sh
I find it a bit dangerous to default to the number of CPU cores because then 
nothing else could be actively running on that machine (e.g. regular select 
queries) as they would starve each other.

What if the default of the query option is 1, where we would use the following 
logic to determine the number of threads for the migration?
- If the query option is zero, then use the number of cpu cores
- if the query option is over zero then use the query option.

The recent code change reflects the above.



--
To view, visit http://gerrit.cloudera.org:8080/20077
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacdad996d680fe545cc9a45e6bc64a348a64cd80
Gerrit-Change-Number: 20077
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tamas Mate <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 27 Jun 2023 15:21:11 +0000
Gerrit-HasComments: Yes

Reply via email to