[
https://issues.apache.org/jira/browse/HIVE-17684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606717#comment-16606717
]
Misha Dmitriev commented on HIVE-17684:
---------------------------------------
[~stakiar] I've also checked a number of logs, and looks like all or most
failed tests indeed fail because GC time >= 100 per cent is reported.
I suspect that this may be due to the fact that the underlying
{{org.apache.hadoop.util.GcTimeMonitor}} class uses the JMX API (class
{{java.lang.management.GarbageCollectorMXBean}}) internally. I remember seeing
or reading that this API may sometimes wrongly report e.g. the time spent in
concurrent object marking GC phase (which doesn't pause the JVM) alongside the
normal pause time. If so, these percentages above at or above 100 would be
explainable by incorrect accounting. Looks like this happens only in the heavy
GC situation - I didn't see it in benchmarks in normal mode or when running
these tests locally.
I guess that in these circumstances the right solution would be to make
{{criticalGcTimePercentage}} a configurable variable in HiveConf, and get rid
of the current {{CRITICAL_GC_TIME_PERCENTAGE_TEST/PROD}} If
{{criticalGcTimePercentage}} can be configured separately in normal tests (to
something like 1000% to be on the safe side), in tests that are expected to
fail (to 0%) and in prod (probably to something like the current 50% plus
possibly some "safety margin") - then hopefully the things will work as
expected in all cases. But can it be configured in that way?
Longer term, if we see evidence that the current implementation of
GcTimeMonitor miscalculates the GC time considerably, we can switch to the
existing {{org.apache.hadoop.hive.common.JvmPauseMonitor}} that uses a
different method of estimating the GC time. This class can just be extended to
report the GC time as percentage for our purposes.
> HoS memory issues with MapJoinMemoryExhaustionHandler
> -----------------------------------------------------
>
> Key: HIVE-17684
> URL: https://issues.apache.org/jira/browse/HIVE-17684
> Project: Hive
> Issue Type: Bug
> Components: Spark
> Reporter: Sahil Takiar
> Assignee: Misha Dmitriev
> Priority: Major
> Attachments: HIVE-17684.01.patch, HIVE-17684.02.patch,
> HIVE-17684.03.patch, HIVE-17684.04.patch, HIVE-17684.05.patch,
> HIVE-17684.06.patch
>
>
> We have seen a number of memory issues due the {{HashSinkOperator}} use of
> the {{MapJoinMemoryExhaustionHandler}}. This handler is meant to detect
> scenarios where the small table is taking too much space in memory, in which
> case a {{MapJoinMemoryExhaustionError}} is thrown.
> The configs to control this logic are:
> {{hive.mapjoin.localtask.max.memory.usage}} (default 0.90)
> {{hive.mapjoin.followby.gby.localtask.max.memory.usage}} (default 0.55)
> The handler works by using the {{MemoryMXBean}} and uses the following logic
> to estimate how much memory the {{HashMap}} is consuming:
> {{MemoryMXBean#getHeapMemoryUsage().getUsed() /
> MemoryMXBean#getHeapMemoryUsage().getMax()}}
> The issue is that {{MemoryMXBean#getHeapMemoryUsage().getUsed()}} can be
> inaccurate. The value returned by this method returns all reachable and
> unreachable memory on the heap, so there may be a bunch of garbage data, and
> the JVM just hasn't taken the time to reclaim it all. This can lead to
> intermittent failures of this check even though a simple GC would have
> reclaimed enough space for the process to continue working.
> We should re-think the usage of {{MapJoinMemoryExhaustionHandler}} for HoS.
> In Hive-on-MR this probably made sense to use because every Hive task was run
> in a dedicated container, so a Hive Task could assume it created most of the
> data on the heap. However, in Hive-on-Spark there can be multiple Hive Tasks
> running in a single executor, each doing different things.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)