Hi Ruilong, I've been brainstorming the issue, and this is my proposed solution. Please tell me what you think.
Segments are stateless. In Greenplum, are worried about catalog corruption when a segment dies. In HAWQ, all of the data nodes are stateless. Even if OOM killer ends up killing a segment, we shouldn't need to worry about catalog corruption. *Only the master has a catalog that matters. * My proposition: Because the catalog matters on the master, we should probably continue to run master nodes with vm.overcommit=2. On the segments, however, I think that we shouldn't worry so much about an OOM event. The problem still remains that all queries across the cluster will be canceled if a data node goes offline (at least until HAWQ is able to restart failed query executors). If we *really* want to prevent the segments from being killed, we could tell the kernel to prefer killing the other processes on the node via the /proc/<pid>/oom_score_adj facility. Because Hadoop processes are generally resilient enough to restart failed containers, most Java processes can be treated as more expendable than HAWQ processes. /proc/<pid>/oom_score_odj ref: https://www.kernel.org/doc/Documentation/filesystems/proc.txt Thanks, Taylor Vesely On Fri, Dec 16, 2016 at 7:01 AM, Ruilong Huo <[email protected]> wrote: > Hi HAWQ Community, > > overcommit_memory setting in linux control the behaviour of memory > allocation. In cluster deployed with hawq and hadoop, it is controversial > to set overcommit_memory for the nodes. To be specific, it is recommended > to use overcommit strategy 2 by hawq, while it is recommended to use 1 or 0 > in hadoop. > > This thread is to start the discussion regarding the options to make a > reasonable choice here so that it is good with both products. > > *1. From HAWQ perspective* > > It is recommended to use vm.overcommit_memory = 2 (other than 0 and 1) to > prevent random kill of HAWQ process and thus backend reset. > > If nodes of the cluster are set to overcommit_memory = 0 or 1, there is > risk that running query might get terminated due to backend reset. Even > worse, with overcommit_memory = 1, there is chance that data file and > transaction log might get corrupted due to insufficient cleanup during > process exit when oom happens. More details of overcommit_memory setting in > HAWQ can be found at: Linux-Overcommit-strategies-and-Pivotal-GPDB-HDB > <https://discuss.pivotal.io/hc/en-us/articles/202703383- > Linux-Overcommit-strategies-and-Pivotal-Greenplum-GPDB-Pivotal-HDB-HDB-> > . > > *2. From Hadoop perspective* > > The crash of datanode usually happens when there is not enough heap memory > for JVM. To be specific, JVM allocates more heap (via a malloc or mmap > system call) and the address space has been exhausted. When > overcommit_memory = 2 and we run out of available address space, the system > will return ENOMEM for the system call, and the JVM will crash. > > This is due to the fact is that Java is very address space greedy. It will > allocate large regions of address space that it isn't actually using. The > overcommit_memory = 2 setting doesn't actually restrict physical memory > use, it restricts address space use. Many applications (especially java) > actually allocate sparse pages of memory, and rely on the kernel/OS to > actually provide the memory as soon as a page fault occurs. > > Best regards, > Ruilong Huo >
