[ 
https://issues.apache.org/jira/browse/IMPALA-14792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063552#comment-18063552
 ] 

ASF subversion and git services commented on IMPALA-14792:
----------------------------------------------------------

Commit caeacdf331136b25669e08a7b1cc8ce9e4c1122d in impala's branch 
refs/heads/master from Csaba Ringhofer
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=caeacdf33 ]

IMPALA-14792: Try avoiding hadoop.fs.Path when loading Iceberg tables

Quick and dirty solution to speed up IcebergFileMetadataLoader.
Its correctness is based on the assumption that Iceberg file
locations must be normalized.

Noticed in flamegraphs that org.apache.hadoop.fs.Path constructor
is one of the main CPU consumers during Iceberg table loading,
especially incremental reloads when most file descriptors are reused.
hadoop.fs.Path was used to relativize locations compared to base
table location and to get the "path" part of the URI. These can
be done with simple String operations if we can assume that the
URIs are normalized.

Results on 1M file 25K partition Iceberg table:
Full load:                  13s->10s
Incremental load (0 files): 9s->3.5s

hadoop.fs.Path constructor still uses significant CPU time after
the change, but mainly in functions that run in parallel, so
its effect is not longer that visible in total execution time.

See Jira for before/after flamegraphs.

Change-Id: Idce89117195e0fa64fdd6a6c576bce09ec2e75ea
Reviewed-on: http://gerrit.cloudera.org:8080/24052
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Csaba Ringhofer <[email protected]>


> Incremental updates of Iceberg tables is slow even with 0 new files
> -------------------------------------------------------------------
>
>                 Key: IMPALA-14792
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14792
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Csaba Ringhofer
>            Assignee: Csaba Ringhofer
>            Priority: Major
>              Labels: iceberg
>         Attachments: incremental_after.html, incremental_before.html
>
>
> Noticed with a very big Iceberg table (25K partitions, ~1M files) that 
> incremental refresh is not much faster than full load.
> Full reload: 13s
> {code}
> 0227 11:35:53.533893 2966519 IcebergFileMetadataLoader.java:311] 
> 194b9270d0ad58d9:d277d65700000000] Collected 956989 Iceberg content files 
> into 25000 partitions. Duration: 2s747ms
> I20260227 11:35:53.533991 2966519 ParallelFileMetadataLoader.java:230] 
> 194b9270d0ad58d9:d277d65700000000] Parallel Iceberg file metadata listing 
> using a thread pool of size 5
> I20260227 11:36:03.145077 2966519 IcebergTable.java:548] 
> 194b9270d0ad58d9:d277d65700000000] Loaded file and block metadata for 
> default.bigice. Time taken: 13s166ms
> {code}
> reload after ALTER TABLE SET TBLPROPERTY: 9s
> {code}
> I20260227 11:25:19.225029 2964808 HdfsTable.java:1311] 
> 50478fc5b1d7a62f:4609a73e00000000] Incrementally loaded table metadata for: 
> default.bigice
> I20260227 11:25:22.142279 2964808 IcebergFileMetadataLoader.java:311] 
> 50478fc5b1d7a62f:4609a73e00000000] Collected 0 Iceberg content files into 0 
> partitions. Duration: 21.044us
> I20260227 11:25:26.278229 2964808 IcebergTable.java:548] 
> 50478fc5b1d7a62f:4609a73e00000000] Loaded file and block metadata for 
> default.bigice. Time taken: 8s835msI20260227
> {code}
> Based on some random jstacks most time is spent dealing with pathes:
> {code}
>    java.lang.Thread.State: RUNNABLE
>       at java.net.URI$Parser.scan([email protected]/URI.java:3082)
>       at java.net.URI$Parser.parseAuthority([email protected]/URI.java:3261)
>       at 
> java.net.URI$Parser.parseHierarchical([email protected]/URI.java:3221)
>       at java.net.URI$Parser.parse([email protected]/URI.java:3177)
>       at java.net.URI.<init>([email protected]/URI.java:781)
>       at org.apache.hadoop.fs.Path.initialize(Path.java:259)
>       at org.apache.hadoop.fs.Path.<init>(Path.java:220)
>       at 
> org.apache.impala.catalog.IcebergFileMetadataLoader.getOldFd(IcebergFileMetadataLoader.java:359)
>       at 
> org.apache.impala.catalog.IcebergFileMetadataLoader.loadContentFilesWithOldFds(IcebergFileMetadataLoader.java:188)
>       at 
> org.apache.impala.catalog.IcebergFileMetadataLoader.loadInternal(IcebergFileMetadataLoader.java:130)
>       at 
> org.apache.impala.catalog.IcebergFileMetadataLoader.load(IcebergFileMetadataLoader.java:98)
>       at 
> org.apache.impala.catalog.IcebergTable.loadFileMetadata(IcebergTable.java:534)
>       at org.apache.impala.catalog.IcebergTable.load(IcebergTable.java:467)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to