[
https://issues.apache.org/jira/browse/DRILL-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155239#comment-15155239
]
Parth Chandra commented on DRILL-4313:
--------------------------------------
Here's what I have seen/found
Tableau can use a connection pool to parallelize the execution of queries to as
single data source. Under the covers Tableau will creates a new process for
every connection to Drill. It will then proceed to distribute the queries in
some fashion across the opened connections.
In a test Tableau dashboard, which consisted of 29 queries being sent to Drill,
the pattern I saw was that Tableau would create a single connection that ran a
couple of metadata queries and then created nine more connections (each in a
new process) and the ten connections executed the remaining queries.
The problems -
1) Logging is not safe
The creation of many processes has an interesting side effect. The ODBC driver
initializes the drill client library logging with a new name every time it is
loaded and uses a timestamp to create unique names. Since the Tableau pool is
initialized at the same time, most connections get created with the same file
name, and only one succeeds. The other connections then cannot log anything.
Additionally, logging is not really thread safe in the client. Multiple threads
tend to make the log less readable as log statements from two threads get
intermixed.
2) The std::rand function is unreliable
The C library rand() function is close to being removed from the standard
because it is inherently flawed. (See
http://cpp.indi.frih.net/blog/2014/12/the-bell-has-tolled-for-rand/ for an easy
to read explanation)
The alternative is to switch to Boost (or upgrade the build to c++ 11 ) which
provide a random library that is much better. Both provide a random seed method
that can use device dependent methods to provide a truly random seed, and a
pseudo random number generator (mt19937) that performs much better.
3) With logging fixed, and the random number generator updated, Tableau's
pattern still causes uneven distribution. A situation similar to the one below
occurred fairly frequently -
(Note a similar unevenness occurred with a 10 node cluster as well)
Tableau connections - 10
Cluster size - 3
Queries - 29
Connection Node Num queries sent
1 n1 5
2 n2 2
3 n1 4
4 n3 1
5 n3 2
6 n2 4
7 n1 3
8 n3 2
9 n2 3
10 n1 3
n1 has 15 queries, while n3 has only 5 queries sent to it.
4) Client side pooling improves this but is sometimes still a little askew. The
worst I saw -
n1 - 12 queries
n2 - 9 queries
n3 - 8 queries
Client side pooling has an additional problem, we cannot maintain session
settings across the pool without additional work. One option is for the client
library to maintain all alter session queries and replay them across all
connections in the pool (ugly). Another option is to create a session id and
maintain the id in Zookeeper. As part of the handshake the client would either
request a new session or ask to join (reuse) an existing session based on the
session id. (This is not simple and promises to cause grief, IMHO). This
option also breaks backward compatibility.
I have the implementation for client side connection pooling with the caveat
that the user can only use system level options. Since Tableau appears to
create a connection pool itself, I don't see how Tableau would be using session
level options anyway.
I don't think this should be exposed to the end user unless they really want it
(and it appears that some do), so this would be something that can not be
enabled thru the ODBC driver but by some other means like an environment
variable. It would also be off by default.
Thoughts?
> C++ client - Improve method of drillbit selection from cluster
> --------------------------------------------------------------
>
> Key: DRILL-4313
> URL: https://issues.apache.org/jira/browse/DRILL-4313
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: Parth Chandra
> Assignee: Parth Chandra
> Fix For: 1.6.0
>
>
> The current C++ client handles multiple parallel queries over the same
> connection, but that creates a bottleneck as the queries get sent to the same
> drillbit.
> The client can manage this more effectively by choosing from a configurable
> pool of connections and round robin queries to them.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)