[
https://issues.apache.org/jira/browse/IMPALA-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062790#comment-18062790
]
Michael Smith edited comment on IMPALA-13122 at 3/4/26 6:25 PM:
----------------------------------------------------------------
[~arnabk1108] I see
custom_cluster.test_file_metadata_stats.TestFileMetadataStats.test_file_metadata_stats_host_disk_pairs
failing when run with erasure coding enabled. Please take a look. To enable
erasure coding
{code:java}
export ERASURE_CODING=true
./buildall.sh -notests -format -start_minicluster -start_impala_cluster
create_testdata.sh
load-data.py --workloads functional-query --table_format text/none --table_name
alltypessmall{code}
to rebuild HDFS with erasure coding and load necessary testdata.
h3. Error Message
{code:java}
AssertionError: Expected at least one line in file
/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_cluster_tests/TestFileMetadataStats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086
matching regex 'Hosts: \d+', but found none. {code}
h3. Stacktrace
{code:java}
custom_cluster/test_file_metadata_stats.py:130: in
test_file_metadata_stats_host_disk_pairs
self.assert_catalogd_log_contains("INFO", hosts_regex, expected_count=-1,
hosts_regex = 'Hosts: \\d+'
self =
<tests.custom_cluster.test_file_metadata_stats.TestFileMetadataStats object at
0x7f3c14167710>
tbl_name = 'functional.alltypessmall'
common/impala_test_suite.py:1724: in assert_catalogd_log_contains
return self.assert_log_contains(
daemon = 'catalogd'
dry_run = False
expected_count = -1
level = 'INFO'
line_regex = 'Hosts: \\d+'
node_index = 0
self =
<tests.custom_cluster.test_file_metadata_stats.TestFileMetadataStats object at
0x7f3c14167710>
timeout_s = 15
common/impala_test_suite.py:1802: in assert_log_contains
assert found > 0, "Expected at least one line in file %s matching regex
'%s'"\
E AssertionError: Expected at least one line in file
/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_cluster_tests/TestFileMetadataStats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086
matching regex 'Hosts: \d+', but found none.
daemon = 'catalogd'
dry_run = False
expected_count = -1
found = 0
last_re_result = None
level = 'INFO'
line = 'I20260303 21:38:19.042086 1092539 catalog-server.cc:790]
A catalog update with 6 entries is assembled. Catalog version: 2140 Last sent
catalog version: 2139\n'
line_regex = 'Hosts: \\d+'
log_file = <_io.BufferedReader
name='/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_clus...tats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086'>
log_file_path =
'/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_cluster_tests/TestFileMetadataStats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086'
pattern = re.compile('Hosts: \\d+')
re_result = None
self =
<tests.custom_cluster.test_file_metadata_stats.TestFileMetadataStats object at
0x7f3c14167710>
start_time = 1772602699.0046756
timeout_s = 15 {code}
was (Author: JIRAUSER288956):
[~arnabk1108] I see
custom_cluster.test_file_metadata_stats.TestFileMetadataStats.test_file_metadata_stats_host_disk_pairs
failing when run with erasure coding enabled. Please take a look. To enable
erasure coding
{code:java}
export ERASURE_CODING=true
./buildall.sh -notests -format -start_minicluster -start_impala_cluster
create_testdata.sh
load-data.py --workloads functional-query --table_format text/none --table_name
alltypessmall{code}
to rebuild HDFS with erasure coding and load necessary testdata.
h3. Error Message
AssertionError: Expected at least one line in file
/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_cluster_tests/TestFileMetadataStats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086
matching regex 'Hosts: \d+', but found none.
h3. Stacktrace
custom_cluster/test_file_metadata_stats.py:130: in
test_file_metadata_stats_host_disk_pairs
self.assert_catalogd_log_contains("INFO", hosts_regex, expected_count=-1,
hosts_regex = 'Hosts: \\d+' self =
<tests.custom_cluster.test_file_metadata_stats.TestFileMetadataStats object at
0x7f3c14167710> tbl_name = 'functional.alltypessmall'
common/impala_test_suite.py:1724: in assert_catalogd_log_contains return
self.assert_log_contains( daemon = 'catalogd' dry_run = False expected_count =
-1 level = 'INFO' line_regex = 'Hosts: \\d+' node_index = 0 self =
<tests.custom_cluster.test_file_metadata_stats.TestFileMetadataStats object at
0x7f3c14167710> timeout_s = 15 common/impala_test_suite.py:1802: in
assert_log_contains assert found > 0, "Expected at least one line in file %s
matching regex '%s'"\ E AssertionError: Expected at least one line in file
/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_cluster_tests/TestFileMetadataStats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086
matching regex 'Hosts: \d+', but found none. daemon = 'catalogd' dry_run =
False expected_count = -1 found = 0 last_re_result = None level = 'INFO' line =
'I20260303 21:38:19.042086 1092539 catalog-server.cc:790] A catalog update with
6 entries is assembled. Catalog version: 2140 Last sent catalog version:
2139\n' line_regex = 'Hosts: \\d+' log_file = <_io.BufferedReader
name='/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_clus...tats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086'>
log_file_path =
'/data0/jenkins/workspace/impala-asf-master-core-erasure-coding/repos/Impala/logs/custom_cluster_tests/TestFileMetadataStats/catalogd.impala-ec2-redhat86-m6i-4xlarge-ondemand-17eb.vpc.cloudera.com.jenkins.log.INFO.20260303-213812.1092086'
pattern = re.compile('Hosts: \\d+') re_result = None self =
<tests.custom_cluster.test_file_metadata_stats.TestFileMetadataStats object at
0x7f3c14167710> start_time = 1772602699.0046756 timeout_s = 15
> Show file stats in table loading logs
> -------------------------------------
>
> Key: IMPALA-13122
> URL: https://issues.apache.org/jira/browse/IMPALA-13122
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Reporter: Quanlong Huang
> Assignee: Arnab Karmakar
> Priority: Major
> Labels: ramp-up
> Fix For: Impala 5.0.0
>
>
> Here is an example for table loading logs on a table:
> {noformat}
> I0603 08:46:05.555567 24417 HdfsTable.java:1255] Loading metadata for table
> definition and all partition(s) of tpcds.store_sales (needed by coordinator)
> I0603 08:46:05.642702 24417 HdfsTable.java:1896] Loaded 23 columns from HMS.
> Actual columns: 23
> I0603 08:46:05.767457 24417 HdfsTable.java:3114] Load Valid Write Id List
> Done. Time taken: 26.699us
> I0603 08:46:05.767549 24417 HdfsTable.java:1297] Fetching partition metadata
> from the Metastore: tpcds.store_sales
> I0603 08:46:05.806337 24417 MetaStoreUtil.java:190] Fetching 1824 partitions
> for: tpcds.store_sales using partition batch size: 1000
> I0603 08:46:07.336064 24417 MetaStoreUtil.java:208] Fetched 1000/1824
> partitions for table tpcds.store_sales
> I0603 08:46:07.915474 24417 MetaStoreUtil.java:208] Fetched 1824/1824
> partitions for table tpcds.store_sales
> I0603 08:46:07.915519 24417 HdfsTable.java:1304] Fetched partition metadata
> from the Metastore: tpcds.store_sales
> I0603 08:46:08.840034 24417 ParallelFileMetadataLoader.java:224] Loading file
> and block metadata for 1824 paths for table tpcds.store_sales using a thread
> pool of size 5
> I0603 08:46:09.383904 24417 HdfsTable.java:836] Loaded file and block
> metadata for tpcds.store_sales partitions: ss_sold_date_sk=2450816,
> ss_sold_date_sk=2450817, ss_sold_date_sk=2450818, and 1821 others. Time
> taken: 569.107ms
> I0603 08:46:09.420702 24417 Table.java:1117] last refreshed event id for
> table: tpcds.store_sales set to: -1
> I0603 08:46:09.420794 24417 TableLoader.java:177] Loaded metadata for:
> tpcds.store_sales (4026ms){noformat}
> From the logs, we know the table has 23 columns and 1824 partitions. Time
> spent in loading the table schema and file metadata are also shown.
> However, it's unknown whether there are small files issue under the
> partitions. The underlying storage could also be slow (e.g. S3) which results
> in a long time in loading file metadata.
> It'd be helpful to add these in the logs:
> * number of files loaded
> * min/avg/max of file sizes
> * total file size
> * number of files
> * number of blocks (HDFS only)
> * number of hosts, disks (HDFS/Ozone only)
> * Stats of accessTime and lastModifiedTime
> These can be aggregated in FileMetadataLoader#loadInternal() and logged in
> ParallelFileMetadataLoader#load() or
> HdfsTable#loadFileMetadataForPartitions().
> [https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L177]
> [https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java#L172]
> [https://github.com/apache/impala/blob/ee21427d26620b40d38c706b4944d2831f84f6f5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L836]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]