[
https://issues.apache.org/jira/browse/IMPALA-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648081#comment-17648081
]
ASF subversion and git services commented on IMPALA-11339:
----------------------------------------------------------
Commit 05a4b778d395c8813988610b78b71bcd920be037 in impala's branch
refs/heads/master from Tamas Mate
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=05a4b778d ]
IMPALA-11339: Add Iceberg LOAD DATA INPATH statement
Extend LOAD DATA INPATH statement to support Iceberg tables. Native
parquet tables need Iceberg field ids, therefore to add files this
change uses child queries to load and rewrite the data. The child
queries create > insert > drop the temporary table over the specified
directory.
The create part depends on LIKE PARQUET/ORC clauses to infer the file
format. This requires identifying a file in the directory and using that
to create the temporary table.
The target file or directory is moved to a staging directory before
ingestion similar to native file formats. In case of a query failure the
files are moved back to the original location. Child query executor will
return the error message of the failing query and the child query
profiles will be available through the WebUI.
At this point the PARTITION clause it not supported because it would
require analysis of the PartitionSpec (IMPALA-11750).
Testing:
- Added e2e tests
- Added fe unit tests
Change-Id: I8499945fa57ea0499f65b455976141dcd6d789eb
Reviewed-on: http://gerrit.cloudera.org:8080/19145
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Implement LOAD DATA INPATH for Iceberg tables
> ---------------------------------------------
>
> Key: IMPALA-11339
> URL: https://issues.apache.org/jira/browse/IMPALA-11339
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Zoltán Borók-Nagy
> Assignee: Tamas Mate
> Priority: Major
> Labels: impala-iceberg
>
> Currently Impala doesn't support LOAD DATA statements for Iceberg tables.
> Some user workflows still use this statement, so it would be nice to
> implement it in some way.
> The parameter to LOAD DATA can be a directory or a single file.
> A possible solution would be to
> # Create an external table
> ## If the parameter is a single file, then we can use IMPALA-10934 to define
> an external table on this single file
> ## If the parameter is a directory, then we need to create an external table
> using the directory as table location. To get the table schema we could use
> CREATE TABLE LIKE PARQUET/ORC
> # run an {{insert into iceberg_table select * from tmp_table}}
> # drop the tmp table (not sure if we want to keep or remove the original
> files)
> It does some copying, but probably this would be the safest solution.
> Users might specify the partition columns in the [PARTITION (partcol1=val1,
> partcol2=val2 ...)] clause. In this case the data files don't necessarily
> contain the partition values, i.e. we need to create the tmp table with
> proper partitioning.
> It's possible to create child queries for a single statement, see
> https://github.com/apache/impala/blob/master/be/src/service/child-query.h
> Currently only COMPUTE STATS uses this. They are probably executed in
> parallel, but in this task we need to execute the above statements
> sequentially.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]