This issue has been raised many times. I think Taylor gave a good proposal.
>From long term, I think we should add more tests around killing process randomly. If it leads to corruptions, I think it is a bug. From database perspective, we should not assume that processes cannot be killed under some specific conditions or at some time. Thanks Lei On Sat, Dec 17, 2016 at 1:43 AM, Taylor Vesely <[email protected]> wrote: > Hi Ruilong, > > I've been brainstorming the issue, and this is my proposed solution. Please > tell me what you think. > > Segments are stateless. In Greenplum, are worried about catalog corruption > when a segment dies. In HAWQ, all of the data nodes are stateless. Even if > OOM killer ends up killing a segment, we shouldn't need to worry about > catalog corruption. *Only the master has a catalog that matters. * > > My proposition: > > Because the catalog matters on the master, we should probably continue to > run master nodes with vm.overcommit=2. On the segments, however, I think > that we shouldn't worry so much about an OOM event. The problem still > remains that all queries across the cluster will be canceled if a data node > goes offline (at least until HAWQ is able to restart failed query > executors). > If we *really* want to prevent the segments from being killed, we could > tell the kernel to prefer killing the other processes on the node via the > /proc/<pid>/oom_score_adj facility. Because Hadoop processes are generally > resilient enough to restart failed containers, most Java processes can be > treated as more expendable than HAWQ processes. > > /proc/<pid>/oom_score_odj ref: > https://www.kernel.org/doc/Documentation/filesystems/proc.txt > > Thanks, > > Taylor Vesely > > On Fri, Dec 16, 2016 at 7:01 AM, Ruilong Huo <[email protected]> wrote: > > > Hi HAWQ Community, > > > > overcommit_memory setting in linux control the behaviour of memory > > allocation. In cluster deployed with hawq and hadoop, it is controversial > > to set overcommit_memory for the nodes. To be specific, it is recommended > > to use overcommit strategy 2 by hawq, while it is recommended to use 1 > or 0 > > in hadoop. > > > > This thread is to start the discussion regarding the options to make a > > reasonable choice here so that it is good with both products. > > > > *1. From HAWQ perspective* > > > > It is recommended to use vm.overcommit_memory = 2 (other than 0 and 1) to > > prevent random kill of HAWQ process and thus backend reset. > > > > If nodes of the cluster are set to overcommit_memory = 0 or 1, there is > > risk that running query might get terminated due to backend reset. Even > > worse, with overcommit_memory = 1, there is chance that data file and > > transaction log might get corrupted due to insufficient cleanup during > > process exit when oom happens. More details of overcommit_memory setting > in > > HAWQ can be found at: Linux-Overcommit-strategies-and-Pivotal-GPDB-HDB > > <https://discuss.pivotal.io/hc/en-us/articles/202703383- > > Linux-Overcommit-strategies-and-Pivotal-Greenplum-GPDB-Pivotal-HDB-HDB-> > > . > > > > *2. From Hadoop perspective* > > > > The crash of datanode usually happens when there is not enough heap > memory > > for JVM. To be specific, JVM allocates more heap (via a malloc or mmap > > system call) and the address space has been exhausted. When > > overcommit_memory = 2 and we run out of available address space, the > system > > will return ENOMEM for the system call, and the JVM will crash. > > > > This is due to the fact is that Java is very address space greedy. It > will > > allocate large regions of address space that it isn't actually using. The > > overcommit_memory = 2 setting doesn't actually restrict physical memory > > use, it restricts address space use. Many applications (especially java) > > actually allocate sparse pages of memory, and rely on the kernel/OS to > > actually provide the memory as soon as a page fault occurs. > > > > Best regards, > > Ruilong Huo > > >
