[jira] [Updated] (KUDU-2665) BlockManagerStressTest.StressTest is extremely flaky

Adar Dembo (JIRA) Wed, 30 Jan 2019 02:17:18 -0800


     [ 
https://issues.apache.org/jira/browse/KUDU-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adar Dembo updated KUDU-2665:
-----------------------------
    Attachment: block_manager-stress-test.txt.gz
      Priority: Blocker  (was: Major)
    Issue Type: Bug  (was: New Feature)

I think I've identified the root cause.

The new runtime container deletion code works as follows:
# When a deletion transaction goes out of scope, we check all containers that 
participated. If any are both full and have no live blocks _right now_, we 
declare them to be dead, mark them as such, and remove their refs from global 
log block manager state.
# The dead containers continue to live in memory because they have other 
referrents. These referrents are ongoing {{WritableBlock}} instances (shouldn't 
be any because the container is dead) and opened {{ReadableBlock}} instances 
(may exist).
# When the last referrent is closed, the container's destructor runs. Because 
the container was marked as dead, its on disk files are now removed.

To make all this work, it is assumed that when a container is both full and has 
no live blocks anymore, it is going to remain in that state in perpetuity. 
That's logically true: a full container with no live blocks isn't going to be 
used for any new blocks. However, due to the nature of {{WritableBlock}} 
finalization/closing, it's possible for a container with outstanding 
{{WritableBlock}} instances to briefly appear as dead. That's because:
# The container's next block offset (responsible for determining fullness) is 
incrmented when the {{WritableBlock}} is finalized, but
# The container's live block count is incremented when the {{WritableBlock}} is 
_closed_.

Thus, if the "last" block in a container is deleted after a {{WritableBlock}} 
was finalized but before it has been closed, the container will be erroneously 
marked as dead. What's the effect? When the container's last referrent 
disappears (i.e. last outstanding {{ReadableBlock}} is closed), it will be 
deleted from disk _despite having live blocks in it_. Because 
block_manager-stress-test restarts from time to time, the block manager thus 
loses blocks that the test still expects to find.

I'm attaching the test's output with a lot more instrumentation showing the bug.

We absolutely need to fix this before releasing 1.9, or to at least disable the 
runtime container deletion code.

> BlockManagerStressTest.StressTest is extremely flaky
> ----------------------------------------------------
>
>                 Key: KUDU-2665
>                 URL: https://issues.apache.org/jira/browse/KUDU-2665
>             Project: Kudu
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 1.9.0
>            Reporter: Mike Percy
>            Assignee: HeLifu
>            Priority: Blocker
>             Fix For: 1.9.0
>
>         Attachments: block_manager-stress-test.txt.gz
>
>
> After some recent block manager changes the Block Manager Stress Test is 
> about 50% flaky on certain precommit builds. The failure looks like this:
> {code:java}
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/fs/block_manager-stress-test.cc:518:
>  Failure
> Failed
> Bad status: Not found: 
> /data/somelongdirectorytoavoidrpathissues/src/kudutest/block_manager-stress-test.0.BlockManagerStressTest_1.StressTest.1547778831841692-23619/data/e8ab31ef3e2143a5bc6d7a2b40e7805b.data:
>  No such file or directory (error 2)
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/fs/block_manager-stress-test.cc:549:
>  Failure
> Expected: this->InjectNonFatalInconsistencies() doesn't generate new fatal 
> failures in the current thread.
>  Actual: it does.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2665) BlockManagerStressTest.StressTest is extremely flaky

Reply via email to