Hi Ruilong,

I've been brainstorming the issue, and this is my proposed solution. Please
tell me what you think.

Segments are stateless. In Greenplum, are worried about catalog corruption
when a segment dies. In HAWQ, all of the data nodes are stateless. Even if
OOM killer ends up killing a segment, we shouldn't need to worry about
catalog corruption. *Only the master has a catalog that matters. *

My proposition:

Because the catalog matters on the master, we should probably continue to
run master nodes with vm.overcommit=2. On the segments, however, I think
that we shouldn't worry so much about an OOM event. The problem still
remains that all queries across the cluster will be canceled if a data node
goes offline (at least until HAWQ is able to restart failed query executors).
If we *really* want to prevent the segments from being killed, we could
tell the kernel to prefer killing the other processes on the node via the
/proc/<pid>/oom_score_adj facility. Because Hadoop processes are generally
resilient enough to restart failed containers, most Java processes can be
treated as more expendable than HAWQ processes.

/proc/<pid>/oom_score_odj ref:
https://www.kernel.org/doc/Documentation/filesystems/proc.txt

Thanks,

Taylor Vesely

On Fri, Dec 16, 2016 at 7:01 AM, Ruilong Huo <[email protected]> wrote:

> Hi HAWQ Community,
>
> overcommit_memory setting in linux control the behaviour of memory
> allocation. In cluster deployed with hawq and hadoop, it is controversial
> to set overcommit_memory for the nodes. To be specific, it is recommended
> to use overcommit strategy 2 by hawq, while it is recommended to use 1 or 0
> in hadoop.
>
> This thread is to start the discussion regarding the options to make a
> reasonable choice here so that it is good with both products.
>
> *1. From HAWQ perspective*
>
> It is recommended to use vm.overcommit_memory = 2 (other than 0 and 1) to
> prevent random kill of HAWQ process and thus backend reset.
>
> If nodes of the cluster are set to overcommit_memory  = 0 or 1, there is
> risk that running query might get terminated due to backend reset. Even
> worse, with overcommit_memory = 1, there is chance that data file and
> transaction log might get corrupted due to insufficient cleanup during
> process exit when oom happens. More details of overcommit_memory setting in
> HAWQ can be found at: Linux-Overcommit-strategies-and-Pivotal-GPDB-HDB
> <https://discuss.pivotal.io/hc/en-us/articles/202703383-
> Linux-Overcommit-strategies-and-Pivotal-Greenplum-GPDB-Pivotal-HDB-HDB->
> .
>
> *2. From Hadoop perspective*
>
> The crash of datanode usually happens when there is not enough heap memory
> for JVM. To be specific, JVM allocates more heap (via a malloc or mmap
> system call) and the address space has been exhausted. When
> overcommit_memory = 2 and we run out of available address space, the system
> will return ENOMEM for the system call, and the JVM will crash.
>
> This is due to the fact is that Java is very address space greedy. It will
> allocate large regions of address space that it isn't actually using. The
> overcommit_memory = 2 setting doesn't actually restrict physical memory
> use, it restricts address space use. Many applications (especially java)
> actually allocate sparse pages of memory, and rely on the kernel/OS to
> actually provide the memory as soon as a page fault occurs.
>
> Best regards,
> Ruilong Huo
>

Reply via email to