[ 
https://issues.apache.org/jira/browse/KUDU-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866820#comment-15866820
 ] 

Todd Lipcon commented on KUDU-1447:
-----------------------------------

True, the issue is that we also don't want to get into the business of 
detecting specific kernel versions and distros, etc. The warnings in the logs 
about taking a long time to start a thread are the red flag. Perhaps we can add 
a message to this particular case that says "(if running el6, check that THP is 
disabled)"?

Given it seems you have an easy repro cluster, maybe we can also try the above 
madvise() idea to just fully workaround the issue (though still might require 
version detection)

> Document recommendation to disable THP
> --------------------------------------
>
>                 Key: KUDU-1447
>                 URL: https://issues.apache.org/jira/browse/KUDU-1447
>             Project: Kudu
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> Doing a bunch of cluster testing, I finally got to the root of why sometimes 
> threads take several seconds to start up, causing various timeout issues, 
> false elections, etc. It turns out that khugepaged does synchronous page 
> compaction while holding a process's mmap semaphore, and when that's 
> concurrent with lots of IO, can block for several seconds.
> https://lkml.org/lkml/2011/7/26/103
> To avoid this, we should tell users to set hugepages to "madvise" or "never" 
> -- it's not sufficient to just disable defrag, because khugepaged still runs 
> in the background in that case and causes this sporadic issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to