Lightweight profiling of messages processing

Alexei Scherbakov Tue, 27 Nov 2018 05:37:56 -0800

Igniters,

At work I often have to solve performance issues with Ignite cluster
without having access to source code of running user application.


Looks like Ignite have limited capabilities to identify bottlenecks without
extensive profiling on server and client side (JFR recording , sampling
profilers, regular thread dumps,  etc), which is not always possible.

Even having profiling data not always helpful for determining several types
of bottlenecks, on example, if where is a contention on single
key/partition.

I propose to implement new feature, like lightweight profiling of message
processing.

The feature will allow to have view on message processing statistics for
each node and for all grid nodes.

In short, it's necessary to track each message execution in executors and
record execution statistics like synchronous execution time in executor
thread and waiting time in queue.

Full description:

1. Implement detailed tracking of message waiting in queue and actual
processing by executors with splitting to several time bins. Example of
detailed statistics for each processed message:


Processing time(%)
                                 Message       Total     Average(ms)
     < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
500ms   < 750ms  < 1000ms  > 1000ms
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                GridNearSingleGetRequest   904311372        0.023000
  904240521     57394      7242      3961      1932       229
61        24         4         4
               GridNearSingleGetResponse   340134416        0.041000
  340118791     11660      1167       729       901      1001
158         8         1         0
                     GridNearLockRequest    77088689        0.079000
   77073458     11945      2299       643       311        31
2         0         0         0
 GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
   39580914     28222      6469      4638      9870     13414
2087       137         1         0
        GridDhtAtomicSingleUpdateRequest    37636829        0.277000
   37579375     23247      5915      4210      8954     12917
2048       162         1         0
     GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
   33579805       337        51         3         1         1
0         0         0         0
                GridNearTxPrepareRequest    21667790        0.238000
   21078069    580126      1622      1261      2531      3631
496        40         0        14
                GridDhtTxPrepareResponse    20949873        0.316000
   17130161   3803105      4615      3162      4489      3721
577        34         1         8
                 GridDhtTxPrepareRequest    20949838        0.501000
   16158732   4750217     16183      5735      8472      8994
1353        88        11        53
                 GridDhtTxFinishResponse    13835065        0.007000
   13832519      2476        27        28        14         1
0         0         0         0
                  GridDhtTxFinishRequest    13835028        0.547000
   12084106   1736789      8971      2340      1792       807
118        41         4        60
                 GridNearTxFinishRequest    13762197        0.725000
   11811828   1942499      4441      1400      1201       524
89        34        19       162
               GridDhtAtomicNearResponse     2784422        0.122000
    2783393      1022         5         2         0         0
0         0         0         0
                      GridNearGetRequest     2360483        0.484000
    2345937     14129       244       101        64         8
0         0         0         0
                     GridNearGetResponse     1984243        0.054000
    1981905      2327         8         1         1         1
0         0         0         0
               GridNearTxPrepareResponse      192856        0.153000
     192660       188         1         5         1         1
0         0         0         0
                    GridNearLockResponse      192780        0.091000
     192667       107         3         0         3         0
0         0         0         0
                GridNearTxFinishResponse         177        0.822000
        129        47         1         0         0         0
0         0         0         0
       GridNearAtomicSingleUpdateRequest         124        4.803000
         52        53        19         0         0         0
0         0         0         0
            GridNearAtomicUpdateResponse         120        0.448000
        110        10         0         0         0         0
0         0         0         0
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          1544912252
 1531765132  12965900     59283     28219     40537     45281
6989       568        42       301


Queue waiting time(%)
                                 Message       Total     Average(ms)
     < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
500ms   < 750ms  < 1000ms  > 1000ms
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                GridNearSingleGetRequest   904311372        0.023000
  903130400   1051499     33248     18049     34044     38809
4714       212         8       389
               GridNearSingleGetResponse   340134416        0.041000
  339708055    419479      3732      1004      1222       737
117         2         1        67
                     GridNearLockRequest    77088689        0.079000
   76949792    129666      2003      1624      2585      2531
390        18         2        78
 GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
   39572801     53913      3949      2960      5487      5583
963        45         5        46
        GridDhtAtomicSingleUpdateRequest    37636829        0.277000
   37561043     57294      3721      2736      5339      5716
879        54         6        41
     GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
   33512734     45873      4427      3340      6342      6363
1033        50        10        26
                GridNearTxPrepareRequest    21667790        0.238000
   21584416     76283      1681      1026      1948      2027
330        31         2        46
                GridDhtTxPrepareResponse    20949873        0.316000
   20862632     83578      1214       633       821       817
118        26         8        26
                 GridDhtTxPrepareRequest    20949838        0.501000
   20867551     72264      2265      1516      2773      2905
479        38         6        41
                 GridDhtTxFinishResponse    13835065        0.007000
   13808274     26487       150        68        55        21
3         1         0         6
                  GridDhtTxFinishRequest    13835028        0.547000
   13809380     25142       188       127       102        59
9         2         4        15
                 GridNearTxFinishRequest    13762197        0.725000
   13743925     17365       241       171       278       191
16         4         0         6
               GridDhtAtomicNearResponse     2784422        0.122000
    2780600      3767        29         9         7         9
1         0         0         0
                      GridNearGetRequest     2360483        0.484000
    2355963      3762       194       114       184       228
38         0         0         0
                     GridNearGetResponse     1984243        0.054000
    1979404      4721        31         9        49        28
1         0         0         0
               GridNearTxPrepareResponse      192856        0.153000
     192852         2         0         1         1         0
0         0         0         0
                    GridNearLockResponse      192780        0.091000
     192777         3         0         0         0         0
0         0         0         0
                GridNearTxFinishResponse         177        0.822000
        177         0         0         0         0         0
0         0         0         0
       GridNearAtomicSingleUpdateRequest         124        4.803000
        124         0         0         0         0         0
0         0         0         0
            GridNearAtomicUpdateResponse         120        0.448000
        119         1         0         0         0         0
0         0         0         0
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          1544912252
 1542613019   2071099     57073     33387     61237     66024
9091       483        52       787


This will allow to understand cluster load type and presence of problems
with increased latency in message processing. In general all messages
should be processed within time < 30ms.

2. Log messages which are waiting too long or have been processed for too
long (this is configured threshold).

Track the following: message enqueue time, waiting in queue until
processing is started, head of queue before starting to wait, queue size
before starting to wait, total waiting time, total processing time, queue
size after processing, message content. Example:

Slow message: *enqueueTs*=2018-11-27 15:10:22.241, *waitTime*=0.048,
*procTime*=305.186, *messageId*=3a3064a9, *queueSzBefore*=0,
*headMessageId*=null,
*queueSzAfter*=0, *message*=GridNearTxFinishRequest [miniId=1,
mvccSnapshot=null, super=GridDistributedTxFinishRequest
[topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0],
futId=199a3155761-f379f312-ad4b-4181-acc5-0aacb3391f07, threadId=296,
commitVer=null, invalidate=false, commit=true, baseVer=null, txSize=0,
sys=false, plc=2, subjId=dda703a0-69ee-47cf-9b9a-bf3dc9309feb,
taskNameHash=0, flags=32, syncMode=FULL_SYNC, txState=IgniteTxStateImpl
[activeCacheIds=[644280847], recovery=false, mvccEnabled=false,
txMap=HashSet [IgniteTxEntry [key=KeyCacheObjectImpl [part=8, val=8,
hasValBytes=true], cacheId=644280847, txKey=IgniteTxKey
[key=KeyCacheObjectImpl [part=8, val=8, hasValBytes=true],
cacheId=644280847], val=[op=READ, val=null], prevVal=[op=NOOP, val=null],
oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=false,
entry=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8,
hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0,
nodeOrder=0], hash=8, extras=GridCacheObsoleteEntryExtras
[obsoleteVer=GridCacheVersion [topVer=2147483647, order=0, nodeOrder=0]],
flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
[rdrs=ReaderId[] [], part=8, super=], prepared=0, locked=false,
nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
[topVer=154800617, order=1543320621588, nodeOrder=1]], IgniteTxEntry
[key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true],
cacheId=644280847, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=11,
val=11, hasValBytes=true], cacheId=644280847], val=[op=READ, val=null],
prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [],
filtersPassed=false, filtersSet=false, entry=GridCacheMapEntry
[key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true], val=null,
ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=11,
extras=GridCacheObsoleteEntryExtras [obsoleteVer=GridCacheVersion
[topVer=2147483647, order=0, nodeOrder=0]],
flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
[rdrs=ReaderId[] [], part=11, super=], prepared=0, locked=false,
nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
[topVer=154800617, order=1543320621588, nodeOrder=1]]]],
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=154800617,
order=1543320621587, nodeOrder=4], committedVers=null, rolledbackVers=null,
cnt=0, super=GridCacheIdMessage [cacheId=0]]]]

This will allow to understand type of activity which produces latency
issues and possibly identify culprit code.

3. Track all synchronous waits while processing message: checkpoint locks,
segment locks, page locks, cold page reads from disk, etc.

4. Allow extensions to collection additional statistics for specific
message type. On example, for *TxFinishRequest we could track total
entities written for each cache and size in bytes. For GridNearLockRequest
we could track keys with max lock queue length for identifying contention.

5. Implement additional logging, JMX metrics and control.sh command for
printing currently collected profiling data.

6. Implementation should be as lightweight as possible.

Maybe something else.

BTW, I already started working on it and want to contribute my work to
community. Ticket with PR (not finished) [1], check for
CacheMessageStatsTest and CacheMessageStatsIndexingTest

[1] https://issues.apache.org/jira/browse/IGNITE-10418

Thoughts ?

-- 

Best regards,
Alexei Scherbakov

Lightweight profiling of messages processing

Reply via email to