Hi,

Per request from Todd, here's a log of all the queries I can see in our 
collector.

************************************************************************************************************************************************************
NEGOTIATION

First, the MachinePrivate queries (one per negotiator):

Query info: matched=119041; skipped=0; query_time=2.601966; 
send_time=52.530350; type=MachinePrivate; requirements={true}; locate=0; 
peer=$NEGOTIATOR; projection={}

Then, the public queries (one per negotiator; three different negotiators):

Query info: matched=54959; skipped=72294; query_time=7.576912; 
send_time=52.929494; type=Any; requirements={(((MyType == "Scheduler") || 
(MyType == "Submitter")) || (((MyType == "Machine") && ( 
!regexp("^(T1_|T2_US_)",GLIDEIN_CMSSite) && ((false ? true : (Cpus > 0)) && 
(time() < GLIDEIN_ToRetire))))))}; locate=0; peer=$NEGOTIATOR; projection={}
Query info: matched=28225; skipped=99028; query_time=6.118522; 
send_time=31.376188; type=Any; requirements={(((MyType == "Scheduler") || 
(MyType == "Submitter")) || (((MyType == "Machine") && 
((regexp("^T1_",GLIDEIN_CMSSite) && (((false ? true : (Cpus > 0)) && (time() < 
GLIDEIN_ToRetire)))) || (GLIDEIN_CMSSite =?= "T2_CH_CERN_HLT")))))}; locate=0; 
peer=$NEGOTIATOR; projection={}
Query info: matched=37133; skipped=91718; query_time=7.519020; 
send_time=67.112059; type=Any; requirements={(((MyType == "Scheduler") || 
(MyType == "Submitter")) || (((MyType == "Machine") && 
(regexp("^T2_US_",GLIDEIN_CMSSite) && ((false ? true : (Cpus > 0)) && (time() < 
GLIDEIN_ToRetire))))))}; locate=0; peer=$NEGOTIATOR; projection={}

************************************************************************************************************************************************************
GLIDEINWMS

A query per group (we have 6 groups):

 Query info: matched=1305; skipped=117739; query_time=7.906368; 
send_time=0.216429; type=Machine; requirements={((((true) && (IS_MONITOR_VM =!= 
true) && (GLIDEIN_Factory =!= undefined) && (GLIDEIN_Name =!= undefined) && 
(GLIDEIN_Entry_Name =!= undefined)) && ((GLIDECLIENT_Name =?= 
"CMSG-v1_0.opportunistic"))))}; locate=0; peer=$GLIDEINWMS_FRONTEND; 
projection={GLIDEIN_CredentialIdentifier TotalSlots Cpus Memory 
PartitionableSlot SlotType TotalSlotCpus State Activity EnteredCurrentState 
EnteredCurrentActivity LastHeardFrom GLIDEIN_Factory GLIDEIN_Name 
GLIDEIN_Entry_Name GLIDECLIENT_Name GLIDECLIENT_ReqNode GLIDEIN_Schedd Name}

A query for the whole frontend:
Query info: matched=104226; skipped=14816; query_time=6.779416; 
send_time=9.465658; type=Machine; requirements={((((true) && (IS_MONITOR_VM =!= 
true) && (GLIDEIN_Factory =!= undefined) && (GLIDEIN_Name =!= undefined) && 
(GLIDEIN_Entry_Name =!= undefined)) && ((substr(GLIDECLIENT_Name,0,10) =?= 
"CMSG-v1_0."))))}; locate=0; peer=$GLIDEINWMS_FRONTEND; projection={State 
Activity PartitionableSlot TotalSlots Cpus Memory Name}

A query for the whole pool:
Query info: matched=119070; skipped=0; query_time=2.002817; 
send_time=13.366689; type=Machine; requirements={((((true)) && (true)))}; 
locate=0; peer=$GLIDEINWMS_FRONTEND; projection={State Activity 
PartitionableSlot TotalSlots Cpus Memory Name}

For the last two queries above, I cannot tell if they are repeated once per 
group or once globally.  I suspect once per group, given the multiprocess 
architecture of gWMS.

Schedd load information (could go to schedd-only collector with gWMS code 
changes?):

Query info: matched=43; skipped=0; query_time=0.000143; send_time=0.043684; 
type=Scheduler; requirements={((true))}; locate=0; peer=$GLIDEINWMS_FRONTEND; 
projection={TotalRunningJobs TotalSchedulerJobsRunning 
TransferQueueNumUploading MaxJobsRunning TransferQueueMaxUploading 
CurbMatchmaking Name}

Schedd full info (looks redundant?):
Query info: matched=1; skipped=42; query_time=0.000377; send_time=0.010730; 
type=Scheduler; requirements={((stricmp(Name,"$SCHEDD") == 0))}; locate=0; 
peer=$GLIDEINWMS_FRONTEND; projection={}
Query info: matched=0; skipped=43; query_time=0.000164; send_time=0.000148; 
type=Scheduler; requirements={((Name =?= "$SCHEDD"))}; locate=0; 
peer=$GLIDEINWMS_FRONTEND; projection={ScheddIpAddr SPOOL_DIR_STRING 
LOCAL_DIR_STRING Name}

****************************************************************************************************************************************************************
CRAB3 location finding:

One per schedd:
Query info: matched=1; skipped=42; query_time=0.001595; send_time=0.000409; 
type=Scheduler; requirements={((Name =?= "$SCHEDD"))}; locate=0; 
peer=$CRAB_SERVER; projection={AddressV1 CondorPlatform CondorVersion Machine 
MyAddress Name MyType ScheddIpAddr RemoteCondorSetup}


****************************************************************************************************************************************************************
MONITORING

These *could* point at the non-active top-level collector.

Various cluster monitoring:
Query info: matched=139912; skipped=10279; query_time=7.712620; 
send_time=22.114816; type=Any; requirements={((GLIDEIN_CMSSite =!= undefined && 
SlotType =!= undefined))}; locate=0; peer=$MONITORING; projection={Name 
PartitionableSlot GLIDEIN_CMSSite TotalSlotCpus Cpus SlotType State Memory 
TotalMemory ChildMemory ChildCpus ChildAccountingGroup AccountingGroup 
GLIDEIN_ToRetire GLIDEIN_ToDie}
/var/log//condor/CollectorLog.20170317T205756:03/17/17 20:57:40 Query info: 
matched=112277; skipped=31499; query_time=4.845609; send_time=6.553424; 
type=Any; requirements={((State =?= "Claimed"))}; locate=0; peer=$MONITORING; 
projection={Name}
Query info: matched=3; skipped=0; query_time=0.000051; send_time=0.000190; 
type=Negotiator; requirements={((true))}; locate=0; peer=$MONITORING; 
projection={Name LastNegotiationCycleDuration0}

This *could* point at the schedd-only collector.  Also runs a very old version 
of python that does not project for .locateAll:
Query info: matched=43; skipped=0; query_time=0.000129; send_time=1.846534; 
type=Scheduler; requirements={true}; locate=0; peer=$SCHEDD; projection={}
https://github.com/CMSCompOps/WmAgentScripts/blob/master/CondorMonitoring/JobCounter.py#L552

****************************************************************************************************************************************************************
STUPID CRAP:

>From a monitoring script, one per glexec failure on the entire grid:

Query info: matched=1; skipped=146622; query_time=2.896500; send_time=0.000681; 
type=Machine; requirements={((Name == 
"slot1@glidein_29695_236793050@$WORKERNODE"))}; locate=0; peer=$SCHEDD; 
projection={glidein_cmssite}


****************************************************************************************************************************************************************
UNKNOWN BUT NOTABLE:

There are a few heavy-duty queries that happen relatively infrequently (once 
every 15 minutes?) that I know exist but haven't found in the logs.  We tend to 
point these at the non-active top-level collector, which is maybe why I don't 
see them?

So, prioritization:
0) Fix stupid crap.
1) Fix monitoring query that is blocking the top-level collector.
2) Move schedd-only queries to schedds-only collector.
3) Point remaining monitoring at non-active collector.
*** At this point, only negotiator and gWMS remains ***
4) Ask gWMS devs to remove duplicate queries of pool size.

At this point, the only thing remaining would be negotiation and glideinWMS 
which, I suspect, is still too much.

Brian

PS - at time of writing, the top-level collector has been dropping packets 
continuously.  Over a span of 5 minutes, I haven't seen it *not* be at the max 
of 256MB backlog on the UDP socket.

_______________________________________________
HTCondor-devel mailing list
HTCondor-devel@cs.wisc.edu
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel

Reply via email to