[
https://issues.apache.org/jira/browse/IMPALA-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy updated IMPALA-11339:
---------------------------------------
Description:
Currently Impala doesn't support LOAD DATA statements for Iceberg tables.
Some user workflows still use this statement, so it would be nice to implement
it in some way.
The parameter to LOAD DATA can be a directory or a single file.
A possible solution would be to
# Create an external table
## If the parameter is a single file, then we can use IMPALA-10934 to define
an external table on this single file
## If the parameter is a directory, then we need to create an external table
using the directory as table location. To get the table schema we could use
CREATE TABLE LIKE PARQUET/ORC
# run an {{insert into iceberg_table select * from tmp_table}}
# drop the tmp table (not sure if we want to keep or remove the original files)
It does some copying, but probably this would be the safest solution.
Users might specify the partition columns in the [PARTITION (partcol1=val1,
partcol2=val2 ...)] clause. In this case the data files don't necessarily
contain the partition values, i.e. we need to create the tmp table with proper
partitioning.
It's possible to create child queries for a single statement, see
https://github.com/apache/impala/blob/master/be/src/service/child-query.h
Currently only COMPUTE STATS uses this. They are probably executed in parallel,
but in this task we need to execute the above statements sequentially.
was:
Currently Impala doesn't support LOAD DATA statements for Iceberg tables.
Some user workflows still use this statement, so it would be nice to implement
it in some way.
The parameter to LOAD DATA can be a directory or a single file.
A possible solution would be to
# Create an external table
## If the parameter is a single file, then we can use IMPALA-10934 to define
an external table on this single file
## If the parameter is a directory, then we need to create an external table
using the directory as table location. To get the table schema we could use
CREATE TABLE LIKE PARQUET/ORC
# run an {{insert into iceberg_table select * from tmp_table}}
# drop the tmp table (not sure if we want to keep or remove the original files)
It does some copying, but probably this would be the safest solution.
Users might specify the partition columns in the [PARTITION (partcol1=val1,
partcol2=val2 ...)] clause. In this case the data files don't necessarily
contain the partition values, i.e. we need to create the tmp table with proper
partitioning.
> Implement LOAD DATA INPATH for Iceberg tables
> ---------------------------------------------
>
> Key: IMPALA-11339
> URL: https://issues.apache.org/jira/browse/IMPALA-11339
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Zoltán Borók-Nagy
> Assignee: LiPenglin
> Priority: Major
> Labels: impala-iceberg
>
> Currently Impala doesn't support LOAD DATA statements for Iceberg tables.
> Some user workflows still use this statement, so it would be nice to
> implement it in some way.
> The parameter to LOAD DATA can be a directory or a single file.
> A possible solution would be to
> # Create an external table
> ## If the parameter is a single file, then we can use IMPALA-10934 to define
> an external table on this single file
> ## If the parameter is a directory, then we need to create an external table
> using the directory as table location. To get the table schema we could use
> CREATE TABLE LIKE PARQUET/ORC
> # run an {{insert into iceberg_table select * from tmp_table}}
> # drop the tmp table (not sure if we want to keep or remove the original
> files)
> It does some copying, but probably this would be the safest solution.
> Users might specify the partition columns in the [PARTITION (partcol1=val1,
> partcol2=val2 ...)] clause. In this case the data files don't necessarily
> contain the partition values, i.e. we need to create the tmp table with
> proper partitioning.
> It's possible to create child queries for a single statement, see
> https://github.com/apache/impala/blob/master/be/src/service/child-query.h
> Currently only COMPUTE STATS uses this. They are probably executed in
> parallel, but in this task we need to execute the above statements
> sequentially.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]