@jihoonson 
My test environment had 3 brokers, 2 coordinators, 2 overlords, ~40 Middle 
Managers (each running about 6 kafka indexing tasks created by kafka 
supervisor), about 15-20 Historicals.

some background and information is noted in 
https://groups.google.com/forum/#!msg/druid-development/eIWDPfhpM_U/AzMRxSQGAgAJ
but, there are 3 completely  _independent_ things here...
1) switching coordinator to use HTTP (using `HttpLoadQueuePeon`) for segment 
assignment (load/drop)
2) switching broker/coordinator to use HTTP (using `HttpServerInventoryView`) 
for discovering what segments are served by queryable nodes (historicals, and 
peons doing indexing)
3) switching overlord to use HTTP for task mgmt (using `HttpRemoteTaskRunner`)

In my comment above I was talking about trying making (1) and (2) default after 
a bit of testing on some more clusters that you have.

looks like #6201 pertains to (3) , so let us not consider enabling (3) by 
default at this time until we get to the bottom of #6201 .

However, after (1), (2) and (3) are done with druid clusters using HTTP . And, 
we remove coordinator/overlord service announcement that is always done in ZK, 
to support tranquility.
Then , technically, it becomes possible to write extensions for discovery that 
don't necessarily use zookeeper and use say etcd instead. However, this is also 
an independent activity which will take its own time, so don't want to make it 
a prerequisite for trying out http or default to it as we gain more confidence 
with those features. And, remove zookeeper code in phases that is not needed 
(i.e. after say 4-6 months from a release where specific thing was made default)

each of (1), (2) lead to one additional connection per broker/coordinator to 
each queryable node.
(3) leads to one additional connection per overlord to each MiddleManager node.

On broker/coordinator/overlord side, `EscalatedGlobal httpClient` is used for 
making requests, so connections from their pools are used, new connection pools 
are created.

> One thing I'm concerned is the increasing HTTP connections.

theoretically, it should be OK and so far testing above , I haven't seen any 
connections issue popping up due to these features. but, concern is valid and 
we can be more confident only as we roll it on more clusters.

> On the other day, I could see Kafka indexing service was using too many HTTP 
> connections compared to the number of worker threads even though the cluster 
> was not using HTTP-based orverlords or coordinators. The number of HTTP 
> connections was a few thousand which is not so high, but I'm not sure what is 
> the proper default configuration for the number of worker threads.

I am assuming you meant overlord http client [worker threads] had thousands of 
outbound open connections.
for `EscalatedGlobal` client used by KIS as well, number of connections are set 
at 
https://github.com/apache/incubator-druid/blob/master/server/src/main/java/io/druid/guice/http/HttpClientModule.java#L140
 (default value is 20 ).
So, at overlord, from that httpClient, maximum possible connections = 20 (or 
whatever is configured) X (number of KIS task peons, and any other processes 
that overlord could talk to using this client over HTTP)
from 
https://github.com/apache/incubator-druid/blob/master/server/src/main/java/io/druid/initialization/Initialization.java#L377
 , I see there are at least 3 other HttpClient instances created with their own 
connection pools, so see if those are using the connections.
if above accounts for thousands of connections, then it is explained or else 
there is some bug in `HttpClient` code and it creates more connections than it 
is told to.
It would be good if you take a look at what machines those connections are 
going to and see if those connections numbers make sense from the expectations 
above.

that said, features in (1), (2), (3) don't necessarily worsen the situation 
because we have far more http requests all around going on due to other 
features. I may be proven wrong in the end, but we wouldn't know till we try :) 
.




[ Full content available at: 
https://github.com/apache/incubator-druid/issues/6176 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to