HeLifu created KUDU-2942:
----------------------------

             Summary: A rare flaky test for the aggregated live row count
                 Key: KUDU-2942
                 URL: https://issues.apache.org/jira/browse/KUDU-2942
             Project: Kudu
          Issue Type: Bug
            Reporter: HeLifu
         Attachments: ts_tablet_manager-itest.txt

A few days ago, Adar met a rare flaky test for the live row count in TSAN mode.

 
{code:java}
// code placeholder
/home/jenkins-slave/workspace/kudu-master/3/src/kudu/integration-tests/ts_tablet_manager-itest.cc:642
      Expected: live_row_count
      Which is: 327
To be equal to: table_info->GetMetrics()->live_row_count->value()
      Which is: 654
{code}
It seems the metric value is doubled. And his full test output is in the 
attachment.

 

I reviewed the previous patches and made some unusual guesses. I think one of 
them could explain the issue:

When one master just becomes the leader and there are two heartbeat messages 
from the same tserver that are processed in parallel at 
[Line4239|https://github.com/apache/kudu/blob/1bdae88faefe9b0d43b6897d96cd853bc5dd7353/src/kudu/master/catalog_manager.cc#L4239],
 then the metric value will be doubled because the old tablet stats can be 
accessed concurrently.

Thus, the question becomes how to generate two heartbeat messages from the same 
tserver at the same time? The possible answer is: [First heartbeat 
message|https://github.com/apache/kudu/blob/1bdae88faefe9b0d43b6897d96cd853bc5dd7353/src/kudu/integration-tests/ts_tablet_manager-itest.cc#L741]
 and [Second heartbeat 
message|https://github.com/apache/kudu/blob/1bdae88faefe9b0d43b6897d96cd853bc5dd7353/src/kudu/integration-tests/ts_tablet_manager-itest.cc#L635]

Please don't forget the above case is integrate test environment, not product.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to