[jira] [Commented] (IMPALA-14637) test_event_based_replication is flaky for truncate table

Quanlong Huang (Jira) Fri, 19 Dec 2025 04:44:05 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-14637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18046591#comment-18046591
 ]


Quanlong Huang commented on IMPALA-14637:
-----------------------------------------

The test truncates the table in source_db in Impala. Then run REPL DUMP + REPL 
LOAD in Hive to truncate the table in target_db. Finally verifying the table is 
empty in Impala failed.

The REPL LOAD command finishes at 03:01:39,567 and the failed Impala query 
starts at 03:01:40,407:
{noformat}
repl load repl_source_ozsye into repl_target_wkunk

-- 2025-12-18 03:01:39,567 INFO     MainThread: hs2: executing against Impala 
at localhost:21050. session: ea4975a16d9cc691:da8734908144f3ae main_cursor: 
True user: None
select count(*) from repl_target_wkunk.unpart_tbl; // This passes. Result is 0.
...
select count(*) from repl_target_wkunk.part_tbl;  // This fails. Result is not 
0.

-- 2025-12-18 03:01:40,407 INFO     MainThread: 
504c3091868ad1d4:0c00f08300000000: query started{noformat}
In catalogd logs, query waits enough time until the events are processed:
{noformat}
I20251218 03:01:40.339617   693 CatalogServiceCatalog.java:4713] 
504c3091868ad1d4:0c00f08300000000] waitForHmsEvent request: 
want_minimal_response=true, coordinator=192.168.124.4, timeout_s=10, 
want_db_list=false, want_table_list=false, 
objects=[TABLE:repl_target_wkunk.part_tbl, TABLE:default.repl_target_wkunk, 
DATABASE:default, DATABASE:repl_target_wkunk], should_expand_views=true
...
I20251218 03:01:40.349382   693 MetastoreEventsProcessor.java:1944] 
504c3091868ad1d4:0c00f08300000000] Last synced event id (47169) reached 47169
I20251218 03:01:40.349473   693 CatalogServiceCatalog.java:4790] 
504c3091868ad1d4:0c00f08300000000] waitForHmsEvent succeeds. 
updated_objects=[TABLE:repl_target_wkunk.part_tbl(55900), 
DATABASE:default(42206), DATABASE:repl_target_wkunk(55905)], 
removed_objects=[]{noformat}
The last time that catalogd reloads file metadata of the table is 
03:01:38.367844
{noformat}
I20251218 03:01:38.353439   160 HdfsTable.java:1256] Reloading metadata for 
table definition and all partition(s) of repl_target_wkunk.part_tbl 
(ALTER_BATCH_PARTITIONS 2 events from 47101 to 47102)
I20251218 03:01:38.354635   160 HdfsTable.java:1923] Loaded 13 columns from 
HMS. Actual columns: 13
I20251218 03:01:38.361160   160 HdfsTable.java:2970] Valid writeIds of table 
repl_target_wkunk.part_tbl: repl_target_wkunk.part_tbl:4:4:4:
I20251218 03:01:38.361230   160 HdfsTable.java:3003] Load Valid Write Id List 
Done. Time taken: 1.785ms
I20251218 03:01:38.363641   160 ParallelFileMetadataLoader.java:230] Refreshing 
file and block metadata for 4 paths for table repl_target_wkunk.part_tbl using 
a thread pool of size 4
I20251218 03:01:38.367844   160 HdfsTable.java:781] Loaded file and block 
metadata for repl_target_wkunk.part_tbl partitions: year=2009/month=4, 
year=2009/month=3, year=2009/month=1, and 1 others. Time taken: 
5.679ms{noformat}
The valid writeIds shows that writeId 4 is still open: 
repl_target_wkunk.part_tbl: repl_target_wkunk.part_tbl:4:4:4:
So it's expected that when processing the COMMIT_TXN event, catalogd should 
reload file metadata modified by this writeId. However, catalogd logs show that 
this didn't happen:
{noformat}
I20251218 03:01:39.249342   160 CatalogOpExecutor.java:5549] Not reloading 
partitions of table repl_target_wkunk.part_tbl for event 47109 since the cache 
is already up-to-date
I20251218 03:01:39.249399   160 CatalogServiceCatalog.java:4808] Trying to add 
COMMITTED write ids [4] to table repl_target_wkunk.part_tbl for event 47109
I20251218 03:01:39.249475   160 HdfsTable.java:3095] Setting the catalog 
version of HdfsTable@2df58760 repl_target_wkunk.part_tbl to 55900
I20251218 03:01:39.249501   160 CatalogServiceCatalog.java:4883] Added 
COMMITTED writeId to table repl_target_wkunk.part_tbl: [4] for event 47109
I20251218 03:01:39.249531   160 MetastoreEvents.java:936] EventId: 47109 
EventType: COMMIT_TXN Target: repl_target_wkunk.part_tbl. Scheduling delay: 
31ms, Process time:
I20251218 03:01:39.249559   160 MetastoreEvents.java:947] EventId: 47109 
EventType: COMMIT_TXN Target: CLUSTER_WIDE. Complete process time: 
31ms{noformat}
In a passed run of the test, the valid writeIds loaded in the last file 
metadata loading is "<table_name>:4:9223372036854775807::". So it won't be 
impacted by this COMMIT_TXN event.

Tracking the code, it seems HMS returns an empty WriteEventInfo list for this 
transaction here.
[https://github.com/apache/impala/blob/1684c2d9da20cd27dcfc30dbaf940ac8144b496f/fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java#L1193-L1194]

CC [~VenuReddy], [~hemanth619] for more thoughts.

> test_event_based_replication is flaky for truncate table
> --------------------------------------------------------
>
>                 Key: IMPALA-14637
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14637
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>
> Saw this test failed again like IMPALA-12187: 
> https://jenkins.impala.io/job/ubuntu-20.04-dockerised-tests/4377
> {code:python}
> metadata/test_event_processing.py:122: in test_event_based_replication
>     self._run_event_based_replication_tests_impl(self,
> metadata/test_event_processing_base.py:305: in 
> _run_event_based_replication_tests_impl
>     assert rows_in_part_tbl_target == 0
> E   assert 100 == 0{code}
> But the implementation is much different now that we have both the 
> hms_event_sync feature and hierarchical event processing enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-14637) test_event_based_replication is flaky for truncate table

Reply via email to