[
https://issues.apache.org/jira/browse/DRILL-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331778#comment-15331778
]
John Omernik commented on DRILL-4723:
-------------------------------------
[~adeneche] Do you want me to manually force a heap dump when the Allocation
Failures occur (at this point, I don't think we are out of memory, the Full GC,
and the delays I think are actually causing cluster stability issues...
sometimes they get to OOM conditions (I setup the dump on OOM settings).
However, I wanted to be clear if you wanted the dump prior to OOM.
In addition, I set my max width per node from 40 to 20 and things worked with
the 1024 block size. I still had some Full GC issues, but not nearly as many
and it completed.
Also, any thoughts on why only 2 out of the 5 nodes are actually doing "writes"
to parquet would be welcome.
I would love to see the Parquet write using Direct Memory still this, is a
challenge to work my way through :)
> GC Memory Allocation Issues
> ---------------------------
>
> Key: DRILL-4723
> URL: https://issues.apache.org/jira/browse/DRILL-4723
> Project: Apache Drill
> Issue Type: Bug
> Components: Server
> Affects Versions: 1.6.0
> Environment: 5 Drill bits, MapR 1.6 release. 84GB of Direct Memory,
> 24GB of Heap
> Reporter: John Omernik
> Labels: memory, stability
> Fix For: Future
>
> Attachments: node3.log
>
>
> This issue is reposted from the Drill User List. More logs available on
> request in comments (please specify which logs you may want)
> High level: Drill Cluster gets into a bad state when this occurs.
> This is what I have thus far... I can provide more complete logs on a one on
> one basis.
> The cluster was completely mine, with fresh logs. I ran a CTAS query on a
> large table that over 100 fields. This query works well in other cases,
> however I was working with the Block size, both in MapR FS and Drill Parquet.
> I had successfully tested 512m on each, this case was different. Here are the
> facts in this setup:
> - No Compression in MapRFS - Using Standard Parquet Snappy Compression
> - MapR Block Size 1024m
> - Parquet Block size 1024m
> - Query ends up disappearing in the profiles
> - The UI page listing bits only show 4 bits however 5 are running (node 03
> process is running, but no longer in the bit)
> - Error (copied below) from rest API
> - No output in STD out or STD error on node3. Only two nodes actually had
> "Parquet Writing" logs. The other three on Stdout, did not have any
> issues/errors/
> - I have standard log files, gclogs, the profile.json (before it
> disappeared), and the physical plan. Only some components that looked
> possibly at issue included here
> - The Node 3 GC log shows a bunch of "Full GC Allocation Failures" that take
> 4 seconds or more (When I've seen this in other cases, this time can balloon
> to 8 secs or more)
> - The node 3 output log show some issues with really long RPC issues. Perhaps
> the GCs prevent RPC communication and create a snowball loop effect?
> Other logs if people are interested can be provided upon request. I just
> didn't want to flood the whole list with all the logs.
> Thanks!
> John
> Rest Error:
> ./load_day.py 2016-05-09
> Drill Rest Endpoint: https://drillprod.marathonprod.zeta:20000
> Day: 2016-05-09
> /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769:
> InsecureRequestWarning: Unverified HTTPS request is being made. Adding
> certificate verification is strongly advised. See:
> https://urllib3.readthedocs.org/en/latest/security.html
> InsecureRequestWarning)
> Authentication successful
> Error encountered: 500
> {
> "errorMessage" : "SYSTEM ERROR: ForemanException: One more more nodes lost
> connectivity during query. Identified nodes were
> [atl1ctuzeta03:20001].\n\n\n[Error Id: d7dd0120-f7c0-44ef-ac54-29c746b26354
> on atl1ctuzeta01:20001"
> }
> Possible issue in Node3 Log:
> 2016-06-14 17:25:27,860 [289fc208-7266-6a81-73a1-709efff6c412:frag:1:90] INFO
> o.a.d.e.w.f.FragmentStatusReporter -
> 289fc208-7266-6a81-73a1-709efff6c412:1:90: State to report: RUNNING
> 2016-06-14 17:25:27,871 [289fc208-7266-6a81-73a1-709efff6c412:frag:1:70] INFO
> o.a.d.e.w.fragment.FragmentExecutor -
> 289fc208-7266-6a81-73a1-709efff6c412:1:70: State change requested
> AWAITING_ALLOCATION --> RUNNING
> 2016-06-14 17:25:27,871 [289fc208-7266-6a81-73a1-709efff6c412:frag:1:70] INFO
> o.a.d.e.w.f.FragmentStatusReporter -
> 289fc208-7266-6a81-73a1-709efff6c412:1:70: State to report: RUNNING
> 2016-06-14 17:43:41,869 [BitServer-4] WARN
> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type 1
> took longer than 500ms. Actual duration was 4192ms.
> 2016-06-14 17:45:36,720 [CONTROL-rpc-event-queue] INFO
> o.a.d.e.w.fragment.FragmentExecutor -
> 289fc208-7266-6a81-73a1-709efff6c412:1:0: State change requested RUNNING -->
> CANCELLATION_REQUESTED
> 2016-06-14 17:45:45,740 [CONTROL-rpc-event-queue] INFO
> o.a.d.e.w.f.FragmentStatusReporter -
> 289fc208-7266-6a81-73a1-709efff6c412:1:0: State to report:
> CANCELLATION_REQUESTED
> 2016-06-14 17:46:15,318 [BitServer-3] WARN
> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type 6
> took longer than 500ms. Actual duration was 55328ms.
> 2016-06-14 17:46:36,057 [CONTROL-rpc-event-queue] INFO
> o.a.d.e.w.fragment.FragmentExecutor -
> 289fc208-7266-6a81-73a1-709efff6c412:1:5: State change requested RUNNING -->
> CANCELLATION_REQUESTED
> 2016-06-14 17:46:44,620 [CONTROL-rpc-event-queue] INFO
> o.a.d.e.w.f.FragmentStatusReporter -
> 289fc208-7266-6a81-73a1-709efff6c412:1:5: State to report:
> CANCELLATION_REQUESTED
> 2016-06-14 17:47:01,393 [BitServer-3] WARN
> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type 6
> took longer than 500ms. Actual duration was 29781ms.
> 2016-06-14 17:47:09,463 [CONTROL-rpc-event-queue] INFO
> o.a.d.e.w.fragment.FragmentExecutor -
> 289fc208-7266-6a81-73a1-709efff6c412:1:10: State change requested RUNNING -->
> CANCELLATION_REQUESTED
> 2016-06-14 17:47:26,967 [CONTROL-rpc-event-queue] INFO
> o.a.d.e.w.f.FragmentStatusReporter -
> 289fc208-7266-6a81-73a1-709efff6c412:1:10: State to report:
> CANCELLATION_REQUESTED
> 2016-06-14 17:47:55,593 [BitServer-3] WARN
> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type 6
> took longer than 500ms. Actual duration was 46130ms.
> 2016-06-14 17:48:04,497 [CONTROL-rpc-event-queue] INFO
> o.a.d.e.w.fragment.FragmentExecutor -
> 289fc208-7266-6a81-73a1-709efff6c412:1:15: State change requested RUNNING -->
> CANCELLATION_REQUESTED
> 2016-06-14 17:48:12,742 [BitServer-4] WARN
> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type 1
> took longer than 500ms. Actual duration was 4236ms.
> 2016-06-14 17:48:42,328 [CONTROL-rpc-event-queue] INFO
> o.a.d.e.w.f.FragmentStatusReporter -
> 289fc208-7266-6a81-73a1-709efff6c412:1:15: State to report:
> CANCELLATION_REQUESTED
> 2016-06-14 17:49:36,351 [BitServer-4] WARN
> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type 1
> took longer than 500ms. Actual duration was 4260ms.
> 2016-06-14 17:49:36,351 [BitServer-3] WARN
> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type 6
> took longer than 500ms. Actual duration was 91854ms.
> 2016-06-14 17:50:35,273 [CONTROL-rpc-event-queue] INFO
> o.a.d.e.w.fragment.FragmentExecutor -
> 289fc208-7266-6a81-73a1-709efff6c412:1:20: State change requested RUNNING -->
> CANCELLATION_REQUESTED
> 2016-06-14 17:50:39,322 [289fc208-7266-6a81-73a1-709efff6c412:frag:1:15] INFO
> o.a.d.e.w.fragment.FragmentExecutor -
> 289fc208-7266-6a81-73a1-709efff6c412:1:15: State change requested
> CANCELLATION_REQUESTED --> FAILED
> 2016-06-14 17:50:51,546 [CONTROL-rpc-event-queue] INFO
> o.a.d.e.w.f.FragmentStatusReporter -
> 289fc208-7266-6a81-73a1-709efff6c412:1:20: State to report:
> CANCELLATION_REQUESTED
> 2016-06-14 17:51:36,905 [BitServer-4] WARN
> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type 1
> took longer than 500ms. Actual duration was 4426ms.
> 2016-06-14 17:52:30,805 [BitServer-3] WARN
> o.a.d.exec.rpc.control.ControlServer - Message of mode REQUEST of rpc type 6
> took longer than 500ms. Actual duration was 98767ms.
> 2016-06-14 17:52:47,042 [BitServer-4] WARN
> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc type 1
> took longer than 500ms. Actual duration was 12041ms.
> Possible issues in Node3 gclog:
> 1819.137: [Full GC (Allocation Failure) 23G->23G(24G), 4.0657064 secs]
> 1823.205: [GC concurrent-mark-abort]
> 1823.221: [GC pause (G1 Evacuation Pause) (young)-- 23G->23G(24G), 0.0382934
> secs]
> 1823.271: [GC pause (G1 Evacuation Pause) (young) (initial-mark)
> 23G->23G(24G), 0.0250215 secs]
> 1823.296: [GC concurrent-root-region-scan-start]
> 1823.296: [GC concurrent-root-region-scan-end, 0.0000105 secs]
> 1823.296: [GC concurrent-mark-start]
> 1823.308: [Full GC (Allocation Failure) 23G->23G(24G), 4.3719713 secs]
> 1827.694: [GC pause (G1 Evacuation Pause) (young)-- 23G->23G(24G), 0.0312666
> secs]
> 1827.727: [GC concurrent-mark-abort]
> 1827.735: [GC pause (G1 Evacuation Pause) (young) 23G->23G(24G), 0.0212359
> secs]
> 1827.766: [Full GC (Allocation Failure) 23G->23G(24G), 3.9308980 secs]
> 1831.710: [GC pause (G1 Evacuation Pause) (young)-- 23G->23G(24G), 0.0298229
> secs]
> 1831.750: [GC pause (G1 Evacuation Pause) (young) (initial-mark)
> 23G->23G(24G), 0.0268410 secs]
> 1831.777: [GC concurrent-root-region-scan-start]
> 1831.777: [GC concurrent-root-region-scan-end, 0.0000212 secs]
> 1831.777: [GC concurrent-mark-start]
> 1831.789: [Full GC (Allocation Failure) 23G->23G(24G), 3.9250410 secs]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)