[ 
https://issues.apache.org/jira/browse/IMPALA-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063332#comment-18063332
 ] 

ASF subversion and git services commented on IMPALA-13122:
----------------------------------------------------------

Commit 4a675b72949c859967cc9389bef55402d16c3efa in impala's branch 
refs/heads/master from Arnab Karmakar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=4a675b729 ]

IMPALA-13122 addendum: Fix host statistics logging for erasure coded files

When erasure coding is enabled, disk IDs are unavailable for
EC blocks. The previous implementation only tracked hosts via host:disk
pairs, requiring valid disk IDs. This caused host statistics to be missing
from logs in EC environments.

Fixed by tracking host indices separately from host:disk pairs:
- Added uniqueHostIndices set to FileMetadataStats
- Track all host indices regardless of disk ID availability
- Host:disk pairs still tracked only when disk IDs are valid (>= 0)
- Updated getNumUniqueHosts() to use uniqueHostIndices directly

With this fix:
- Traditional replication: Both hosts and host:disk pairs are logged
- Erasure coding: Hosts are logged, host:disk pairs may be 0 or omitted

Testing:
- All tests pass with and without erasure coding

Change-Id: Ie6f5b70fa9c46dd3f34287f030553360da6b20c6
Reviewed-on: http://gerrit.cloudera.org:8080/24068
Reviewed-by: Michael Smith <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Show file stats in table loading logs
> -------------------------------------
>
>                 Key: IMPALA-13122
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13122
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Quanlong Huang
>            Assignee: Arnab Karmakar
>            Priority: Major
>              Labels: ramp-up
>             Fix For: Impala 5.0.0
>
>
> Here is an example for table loading logs on a table:
> {noformat}
> I0603 08:46:05.555567 24417 HdfsTable.java:1255] Loading metadata for table 
> definition and all partition(s) of tpcds.store_sales (needed by coordinator)
> I0603 08:46:05.642702 24417 HdfsTable.java:1896] Loaded 23 columns from HMS. 
> Actual columns: 23
> I0603 08:46:05.767457 24417 HdfsTable.java:3114] Load Valid Write Id List 
> Done. Time taken: 26.699us
> I0603 08:46:05.767549 24417 HdfsTable.java:1297] Fetching partition metadata 
> from the Metastore: tpcds.store_sales
> I0603 08:46:05.806337 24417 MetaStoreUtil.java:190] Fetching 1824 partitions 
> for: tpcds.store_sales using partition batch size: 1000 
> I0603 08:46:07.336064 24417 MetaStoreUtil.java:208] Fetched 1000/1824 
> partitions for table tpcds.store_sales
> I0603 08:46:07.915474 24417 MetaStoreUtil.java:208] Fetched 1824/1824 
> partitions for table tpcds.store_sales
> I0603 08:46:07.915519 24417 HdfsTable.java:1304] Fetched partition metadata 
> from the Metastore: tpcds.store_sales
> I0603 08:46:08.840034 24417 ParallelFileMetadataLoader.java:224] Loading file 
> and block metadata for 1824 paths for table tpcds.store_sales using a thread 
> pool of size 5
> I0603 08:46:09.383904 24417 HdfsTable.java:836] Loaded file and block 
> metadata for tpcds.store_sales partitions: ss_sold_date_sk=2450816, 
> ss_sold_date_sk=2450817, ss_sold_date_sk=2450818, and 1821 others. Time 
> taken: 569.107ms
> I0603 08:46:09.420702 24417 Table.java:1117] last refreshed event id for 
> table: tpcds.store_sales set to: -1
> I0603 08:46:09.420794 24417 TableLoader.java:177] Loaded metadata for: 
> tpcds.store_sales (4026ms){noformat}
> From the logs, we know the table has 23 columns and 1824 partitions. Time 
> spent in loading the table schema and file metadata are also shown.
> However, it's unknown whether there are small files issue under the 
> partitions. The underlying storage could also be slow (e.g. S3) which results 
> in a long time in loading file metadata.
> It'd be helpful to add these in the logs:
>  * number of files loaded
>  * min/avg/max of file sizes
>  * total file size
>  * number of files
>  * number of blocks (HDFS only)
>  * number of hosts, disks (HDFS/Ozone only)
>  * Stats of accessTime and lastModifiedTime
> These can be aggregated in FileMetadataLoader#loadInternal() and logged in 
> ParallelFileMetadataLoader#load() or 
> HdfsTable#loadFileMetadataForPartitions().
> [https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L177]
> [https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java#L172]
> [https://github.com/apache/impala/blob/ee21427d26620b40d38c706b4944d2831f84f6f5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L836]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to