Zoltán Borók-Nagy created IMPALA-11752:
------------------------------------------

             Summary: Handle s3:// paths in Iceberg tables
                 Key: IMPALA-11752
                 URL: https://issues.apache.org/jira/browse/IMPALA-11752
             Project: IMPALA
          Issue Type: Bug
          Components: Backend, Frontend
            Reporter: Zoltán Borók-Nagy


Components using 
[S3FileIO|https://iceberg.apache.org/docs/latest/aws/#s3-fileio] might write 
out file paths starting with 's3://' instead of 's3a://'. The latter is used by 
[HadoopFileIO|https://iceberg.apache.org/docs/latest/aws/#hadoop-s3a-filesystem]
 that Impala is using.

By default, HadoopFileIO doesn't interpret paths starting with 's3://'. 
(Probably this could be resolved by setting "fs.s3.impl" to 
"org.apache.hadoop.fs.s3a.S3AFileSystem" so that an s3a fs instance is created)

[FeIcebergTable.Utils.FeIcebergTable()|https://github.com/apache/impala/blob/2733d039ad4a830a1ea34c1a75d2b666788e39a9/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java#L671-L689]
 depends on file paths returned by recursive file listing match the file paths 
in Iceberg metadata files. But the recursive listing returns s3a:// paths, 
while metadata contains s3:// paths, which means we'll load files one-by-one as 
we won't find the files in the hash map 'hdfsFileDescMap'.

Moreover, if position delete file processing is also based on exact matches of 
the file URIs. Therefore if entries with s3:// paths won't have the desired 
effects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to