Re: High young GC pause time due to G1GC long termination attempts with OpenJDK 11

Stefan Johansson Mon, 09 Mar 2020 01:35:31 -0700

Hi Fangmin,

Very hard to say for sure what's causing this, but since you are seeinghigh system times this indicates that something outside the JVM might becausing the long pause. Since you ruled out swapping, another possiblecause can be Transparent Hugepages, there is a short section about thishere:

https://docs.oracle.com/javase/9/gctuning/garbage-first-garbage-collector-tuning.htm#GUID-8D9B2530-E370-4B8B-8ADD-A43674FC6658

I suggest that you check if transparent hugepages is configure on yoursystem. If they are, turn them of and re-run your test to see if it helps.

Regarding your finding around the stealing, there can of course besomething problematic there as well but before going down that directionI would like to rule out transparent hugepages first.


Thanks,
Stefan

On 2020-03-08 19:17, Fangmin Lv wrote:

Hi guys,
Need GC expert to help me understand a strange G1GC behavior here, ithas been causing trouble for us for a while, and I've spent more than aweek investigating into this without fully addressing it. Reallyappreciate if I can get some help here.
Our code is running on OpenJDK 11 with following GC settings:


        -Xms144179m -Xmx144179m -XX:+AlwaysPreTouch -XX:+UseG1GC
-XX:MaxGCPauseMillis=500 -Xlog:gc*=debug:file=/tmp/zeus.gc.log
And it's running on machine with 256GB memory and 80 processors.
There are long GC pause (> 2s) from time to time during young GC, andmost of the time is spent in "Evacuating Collection Set". By enablingthe GC debug log, found most of the time was spent in "Object Copy" phase:
e thing I'm trying is cap the steal behavior, skip steal if the taskqueue has less than 5 tasks, this might reduce the contention and CPUcost due to busy steal from those GC threads, but I'm not sure if thiscould fully address the issue here, I'll update here based on the testing.
Thanks again for taking your time to go through these, really appreciateany help here!
    ...
    [74835.843s][debug  ][gc,phases         ] GC(43)     Object Copy
    (ms):         Min: 2271.3, Avg: 2287.8, Max: 2401.6, Diff: 130.2,
    Sum: 86935.4, Workers: 38
    [74835.843s][debug  ][gc,phases         ] GC(43)     Termination
    (ms):         Min:  3.7, Avg:  4.2, Max:  5.0, Diff:  1.3, Sum:
    160.7, Workers: 38
    [74835.843s][debug  ][gc,phases         ] GC(43)       Termination
    Attempts:     Min: 3523, Avg: 3654.5, Max: 3782, Diff: 259, Sum:
    138872, Workers: 38
    ...
    [74835.843s][info      ][gc,heap             ] GC(43) Eden regions:
    1261->0(1258)
    [74835.843s][info      ][gc,heap             ] GC(43) Survivor
    regions: 90->93(169)
    [74835.843s][info      ][gc,heap             ] GC(43) Old regions:
    266->266
    [74835.843s][info      ][gc,heap             ] GC(43) Humongous
    regions: 7->7
    [74835.843s][info      ][gc,metaspace   ] GC(43) Metaspace:
    46892K->46892K(49152K)
    [74835.843s][debug  ][gc,heap            ] GC(43) Heap after GC
    invocations=44 (full 0): garbage-first heap   total 147652608K, used
    11987833K [0x00007f0c34000000, 0x00007f2f68000000)
    [74835.843s][debug  ][gc,heap            ] GC(43)   region size
    32768K, 93 young (3047424K), 93 survivors (3047424K)
[74835.843s][debug ][gc,heap ] GC(43) Metaspaceused 46892K, capacity 47403K, committed 47712K, reserved 49152K
    [74835.843s][info      ][gc                     ] GC(43) Pause Young
    (Normal) (G1 Evacuation Pause) 51966M->11706M(144192M) 2439.941ms
    [74835.843s][info      ][gc,cpu              ] GC(43) User=14.54s
    Sys=1.86s Real=2.44s
Looked into the JDK docs and other blogs shared online to understand theG1GC, tried various settings like reducing the minimal young gen size-XX:G1NewSizePercent=1, -XX:ParallelGCThreads=64 and PLAB size, etc, andthey didn't address the issue.
There are blogs mentioned things like memory swap or disk IO mightcausing long Object Copy, but there are plentiful of memory in thesystem, and we didn't see swap kick in from atop.
Since I cannot get the answer online, I tried to read the code and addinstrument logs to understand the GC behavior (you can find the changedcode in the attachment), and found that the time is spent in long timestealing works from other task queues, the full log is attached.
There is no single GC thread got caught in root scanning or CPUstarving, all of them finished the trim_queue within a few ms. And,started to the long steal_and_trim_queue loop. InG1ParScanThreadState::do_oop_partial_array, it deals with arrayspecially, which will only scan a subset of object array and pushremainder back if array is bigger than ParGCArrayScanChunk, and we dohave array object (temporary) larger than 500k sometimes. I thought thismight be the issue, but we still see the long stealing behavior evenraised this value from 50 to 500, and we saw there are cases there is nolarge array but still had this problem.
By adding more logs, saw all the GC threads seemed to steal the sametask and then stole by another, the task should be different, but thebehavior seems weird:
    [24632.803s][debug  ][gc,ergo           ] GC(11) worker 1 stealed
    from queue 32, its size was 1
    [24632.803s][debug  ][gc,ergo           ] GC(11) worker 28 stealed
    from queue 1, its size was 2
    [24632.803s][debug  ][gc,ergo           ] GC(11) worker 24 stealed
    from queue 28, its size was 1
    [24632.803s][debug  ][gc,ergo           ] GC(11) worker 4 stealed
    from queue 24, its size was 1
    [24632.804s][debug  ][gc,ergo           ] GC(11) worker 5 stealed
    from queue 4, its size was 1
    [24632.804s][debug  ][gc,ergo           ] GC(11) worker 15 stealed
    from queue 5, its size was 1
    [24632.804s][debug  ][gc,ergo           ] GC(11) worker 37 stealed
    from queue 15, its size was 1
    [24632.804s][debug  ][gc,ergo           ] GC(11) worker 16 stealed
    from queue 37, its size was 1
    [24632.804s][debug  ][gc,ergo           ] GC(11) worker 7 stealed
    from queue 16, its size was 1
    [24632.804s][debug  ][gc,ergo           ] GC(11) worker 2 stealed
    from queue 7, its size was 1
    [24632.804s][debug  ][gc,ergo           ] GC(11) worker 29 stealed
    from queue 2, its size was 1
    [24632.804s][debug  ][gc,ergo           ] GC(11) worker 15 stealed
    from queue 29, its size was 1
I tried to follow the code more, and found G1ScanEvacuatedObjClosuremight try to push obj into the task queue after copying the obj tosurvivor space, but I'm not able to follow and think of why this couldkeep happening for a few thousands of stealing, please give me someinsights why this happened, any known scenarios could trigger this?
One thing I'm trying is cap the steal behavior, skip steal if the taskqueue has less than 5 tasks, this might reduce the contention and CPUcost due to busy steal from those GC threads, but I'm not sure if thiscould fully address the issue here, I'll update here based on the testing.
Thanks again for taking your time to go through these, really appreciateany help here!
Thanks,
Fangmin

_______________________________________________
hotspot-gc-use mailing list
[email protected]
https://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

_______________________________________________
hotspot-gc-use mailing list
[email protected]
https://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

Re: High young GC pause time due to G1GC long termination attempts with OpenJDK 11

Reply via email to