FYI, below are some statistics on how long it took to respond to various types of queries, courtesy new statistics monitoring added into v8.7.0.
For comparison, I put stats below from both the CERN global cms pool central manager, and from UW CHTC's central manager. I got these stats via a command like condor_status -pool central-manager.host.edu -collector -statistics all:2 -l | egrep '(^Handle.*|ForkWorkers)'
At first blush, looks to me like the network between the CERN central manager and the clients doing queries occasionally becomes congested/slow. When this happens, forked query workers take a really long time to do their thing as evidenced by the massive numbers for the MissedForkRuntime* stats ("MissedFork" = queries that we did in-process because we would exceeded the max number of forked workers).
regards, Todd Stats from CERN central manager (after running about about 2 days) : CurrentForkWorkers = 2 HandleLocate = 948 HandleLocateForked = 0 HandleLocateForkedRuntime = 0.0 HandleLocateMissedFork = 0 HandleLocateMissedForkRuntime = 0.0 HandleLocateRuntime = 4.252691565779969 HandleLocateRuntimeAvg = 0.004485961567278448 HandleLocateRuntimeMax = 0.5942266429774463 HandleLocateRuntimeMin = 0.000327280955389142 HandleLocateRuntimeStd = 0.02825324607232301 HandleQuery = 72690 HandleQueryForked = 20227 HandleQueryForkedRuntime = 8065.561725556618 HandleQueryForkedRuntimeAvg = 0.398752248260079 HandleQueryForkedRuntimeMax = 19.9297046919819 HandleQueryForkedRuntimeMin = 0.005120144924148917 HandleQueryForkedRuntimeStd = 0.2165687619633886 HandleQueryMissedFork = 7 HandleQueryMissedForkRuntime = 3807.289392268052 HandleQueryMissedForkRuntimeAvg = 543.8984846097218 HandleQueryMissedForkRuntimeMax = 2124.699060306884 HandleQueryMissedForkRuntimeMin = 0.001575499074533582 HandleQueryMissedForkRuntimeStd = 862.8426533178198 HandleQueryRuntime = 5011.952447317541 HandleQueryRuntimeAvg = 0.06894968286308352 HandleQueryRuntimeMax = 251.0224419250153 HandleQueryRuntimeMin = 0.0001686080358922482 HandleQueryRuntimeStd = 1.689945545346726 PeakForkWorkers = 16 Stats from CHTC central manager (after running about 3 weeks) : CurrentForkWorkers = 0 HandleLocate = 77408 HandleLocateForked = 2615 HandleLocateForkedRuntime = 51.6975245885551 HandleLocateForkedRuntimeAvg = 0.01976960787325243 HandleLocateForkedRuntimeMax = 0.04788178578019142 HandleLocateForkedRuntimeMin = 0.007770448923110962 HandleLocateForkedRuntimeStd = 0.00348926807103674 HandleLocateMissedFork = 0 HandleLocateMissedForkRuntime = 0.0 HandleLocateRuntime = 88.02384298294783 HandleLocateRuntimeAvg = 0.001137141419271236 HandleLocateRuntimeMax = 0.1072812303900719 HandleLocateRuntimeMin = 0.0004654340445995331 HandleLocateRuntimeStd = 0.002678180300024122 HandleQuery = 249819 HandleQueryForked = 439004 HandleQueryForkedRuntime = 9062.483967624605 HandleQueryForkedRuntimeAvg = 0.02064328335874982 HandleQueryForkedRuntimeMax = 0.4229466877877712 HandleQueryForkedRuntimeMin = 0.000716477632522583 HandleQueryForkedRuntimeStd = 0.003701340726264902 HandleQueryMissedFork = 9 HandleQueryMissedForkRuntime = 1.984908144921064 HandleQueryMissedForkRuntimeAvg = 0.2205453494356738 HandleQueryMissedForkRuntimeMax = 0.5626467131078243 HandleQueryMissedForkRuntimeMin = 0.0854371152818203 HandleQueryMissedForkRuntimeStd = 0.1904528704960855 HandleQueryRuntime = 862.5680961571634 HandleQueryRuntimeAvg = 0.003452772191695441 HandleQueryRuntimeMax = 0.5754754431545734 HandleQueryRuntimeMin = 0.0002174414694309235 HandleQueryRuntimeStd = 0.009835888478160599 PeakForkWorkers = 10 _______________________________________________ HTCondor-devel mailing list [email protected] https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel
