[ 
https://issues.apache.org/jira/browse/DRILL-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155239#comment-15155239
 ] 

Parth Chandra commented on DRILL-4313:
--------------------------------------

Here's what I have seen/found 
Tableau can use a connection pool to parallelize the execution of queries to as 
single data source. Under the covers Tableau will creates a new process for 
every connection to Drill. It will then proceed to distribute the queries in 
some fashion across the opened connections. 

In a test Tableau dashboard, which consisted of 29 queries being sent to Drill, 
the pattern I saw was that Tableau would create a single connection that ran a 
couple of metadata queries and then created nine more connections (each in a 
new process) and the ten connections executed the remaining queries. 

The problems -
1) Logging is not safe 
The creation of many processes has an interesting side effect. The ODBC driver 
initializes the drill client library logging with a new name every time it is 
loaded and uses a timestamp to create unique names. Since the Tableau pool is 
initialized at the same time, most connections get created with the same file 
name, and only one succeeds. The other connections then cannot log anything. 
Additionally, logging is not really thread safe in the client. Multiple threads 
tend to make the log less readable as log statements from two threads get 
intermixed.

2) The std::rand function is unreliable
The C library rand() function is close to being removed from the standard 
because it is inherently flawed. (See 
http://cpp.indi.frih.net/blog/2014/12/the-bell-has-tolled-for-rand/ for an easy 
to read explanation)
The alternative is to switch to Boost (or upgrade the build to c++ 11 ) which 
provide a random library that is much better. Both provide a random seed method 
that can use device dependent methods to provide a truly random seed, and a 
pseudo random number generator (mt19937) that performs much better.

3) With logging fixed, and the random number generator updated, Tableau's 
pattern still causes uneven distribution. A situation similar to the one below 
occurred fairly frequently -
(Note a similar unevenness occurred with a 10 node cluster as well)
   Tableau connections - 10
   Cluster size - 3 
   Queries - 29

   Connection       Node      Num queries sent
   1                        n1          5
   2                        n2          2
   3                        n1          4
   4                        n3          1
   5                        n3          2
   6                        n2          4
   7                        n1          3
   8                        n3          2
   9                        n2          3 
  10                       n1          3

n1 has 15 queries, while n3 has only 5 queries sent to it.

4) Client side pooling improves this but is sometimes still a little askew. The 
worst I saw -
   n1 - 12 queries
   n2 - 9   queries
   n3 - 8   queries

Client side pooling has an additional problem, we cannot maintain session 
settings across the pool without additional work.  One option is for the client 
library to maintain all alter session queries and replay them across all 
connections in the pool (ugly). Another option is to create a session id and 
maintain the id in Zookeeper. As part of the handshake the client would either 
request a new session or ask to join (reuse) an existing session based on the 
session id. (This is not simple  and promises to cause grief, IMHO). This 
option also breaks backward compatibility.

I have the implementation for client side connection pooling with the caveat 
that the user can only use system level options. Since Tableau appears to 
create a connection pool itself, I don't see how Tableau would be using session 
level options anyway. 
I don't think this should be exposed to the end user unless they really want it 
(and it appears that some do), so this would be something that can not be 
enabled thru the ODBC driver but by some other means like an environment 
variable. It would also be off by default.

Thoughts?


> C++ client - Improve method of drillbit selection from cluster
> --------------------------------------------------------------
>
>                 Key: DRILL-4313
>                 URL: https://issues.apache.org/jira/browse/DRILL-4313
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Parth Chandra
>            Assignee: Parth Chandra
>             Fix For: 1.6.0
>
>
> The current C++ client handles multiple parallel queries over the same 
> connection, but that creates a bottleneck as the queries get sent to the same 
> drillbit.
> The client can manage this more effectively by choosing from a configurable 
> pool of connections and round robin queries to them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to