[shardingsphere] branch master updated: Update engine.en.md (#8229)

panjuan Thu, 19 Nov 2020 18:49:51 -0800

This is an automated email from the ASF dual-hosted git repository.

panjuan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/shardingsphere.git



The following commit(s) were added to refs/heads/master by this push:
     new 5cd00fb  Update engine.en.md (#8229)
5cd00fb is described below

commit 5cd00fbbfa6a359a81cb92b3a82521f1516612d9
Author: yang-7777 <[email protected]>
AuthorDate: Fri Nov 20 10:49:27 2020 +0800

    Update engine.en.md (#8229)
---
 docs/blog/content/material/engine.en.md | 125 +++++++++++++++++++++++++++++++-
 1 file changed, 124 insertions(+), 1 deletion(-)

diff --git a/docs/blog/content/material/engine.en.md 
b/docs/blog/content/material/engine.en.md
index 448bbc3..defbd5b 100644
--- a/docs/blog/content/material/engine.en.md
+++ b/docs/blog/content/material/engine.en.md
@@ -4,4 +4,127 @@ weight = 10
 chapter = true
 +++
 
-## TODO
+## How automatic executor of ShardingSphere works
+
+Today「Analysis of Sharding-Sphere Series Articles」is brings you an 
introduction to the Sharding-Sphere automated execution engine module in this 
article. Since the boss prefers serious technical articles, I try my best to 
use a serious and solemn narrative style to share with you the topic of 
"Sharding-Sphere Automated Execution Engine Module".
+
+I just finished writing and sorting out the code of this module, so before I 
forget, I would like to share this with you, hope this article will help you 
all.
+
+「Analysis of Sharding-Sphere series articles」is conducted by the core 
development members of Sharding-Sphere to introduce and analyze the core 
modules of Sharding-Sphere, the cutting-edge technologies used, and valuable 
experience summaries. This series of articles will take you into the core world 
of Sharding-Sphere, gain new knowledge and inspire inspiration. I hope you will 
follow us, communicate and discuss with us, and we will move forward together.
+
+### About the author
+
+Pan Juan, as a senior DBA worked at JD.com, the responsibility is to develop 
the distributed database and middleware, and the automated management platform 
for database clusters. As a PMC of Apache ShardingSphere, I am willing to 
contribute to the OS community and explore the area of distributed databases 
and NewSQL.
+
+### Concept introduction 
+
+Q: What is "automated execution engine"? 
+
+A: The life cycle of a SQL is: initiated from the client, processed by 
Sharding-Sphere, and then executed and digested in the underlying database. In 
Sharding-Sphere, the process is: SQL analysis-->SQL optimization-->SQL 
routing-->SQL rewriting-->SQL execution-->result merging. The automatic 
execution engine is for deal with the SQL execution problem, that is, how to 
control and efficiently transmit the real SQL after the route rewriting to the 
underlying database for execution. Doesn't  [...]
+
+### Demand scenario
+
+Q: Why do we need an automated execution engine?
+
+A: In the concept introduction section, we introduced the protagonist-the 
automated execution engine. It is also mentioned that its automation is to 
balance the creation of database connections and the selection of results merge 
mode. This is the fate of its birth, the choice of history. The following will 
introduce these two issues that need to be balanced:
+
+1. Database connection creation
+
+As a DBA/Java coder, somehow I still consider the problem from the perspective 
of the DBA. For example, from the perspective of resource control, the number 
of connections for business parties to access the database should be limited, 
which can effectively prevent a certain business operation from occupying too 
much resources, thereby exhausting the resources of the database connection, 
and thus affecting other business visit. Especially when there are many 
sharding tables in a database  [...]
+
+2. Results merge mode selection
+
+From the perspective of execution efficiency, maintaining an independent 
database connection for each fragmented query can more effectively use 
multithreading to improve execution efficiency. Opening independent threads for 
each database connection can parallelize the consumption of IO. The independent 
database connection can keep the reference of the query result set and the 
cursor position, and the cursor can be moved when the corresponding data needs 
to be obtained, avoiding the prema [...]
+
+Streaming merge: The method of merging results by moving the result set cursor 
down is called streaming merge. It does not need to load all the result data 
into memory, which can effectively save memory resources and reduce the 
frequency of garbage collection.
+
+Memory merging: Perform data comparison and merging by reading the result set 
loaded in the memory for merging. It needs to load all the result data into 
memory. 
+
+I believe everyone will definitely choose streaming merge to process the 
result set. However, if there is no guarantee that each shard query holds an 
independent database connection, then you need to load the current query result 
set into memory before reusing the database connection and obtaining the query 
result set of the next shard table. Therefore, even if stream merging can be 
used, it has to use memory merging in this scenario. 
+
+On the one hand, it is the control and protection of database connection 
resources. On the other hand, it uses a better merge mode to save memory 
resources. How to handle the relationship between the two is a problem that the 
ShardingSphere execution engine needs to solve. Specifically, if a piece of SQL 
needs to operate 200 tables under a certain database instance after it has been 
sharding by ShardingSphere, should it choose to create 200 connections for 
parallel execution, or choose t [...]
+
+### Evolutionary theory
+
+For the above scenario, ShardingSphere provided a solution before 3.0.0.M4, 
which introduced the concept of Connection Mode and divided it into two modes: 
MEMORY_STRICTLY and CONNECTION_STRICTLY.
+
+- MEMORY_STRICTLY: This mode is used on the premise that the database has no 
limit on the number of connections it can consume in a single operation. If the 
actual execution of SQL needs to do operations on 200 tables in a database 
instance, a new database connection is created for each table and processed 
concurrently in a multi-threaded manner to maximize execution efficiency. 
Streaming is preferred if SQL satisfies the conditions to prevent memory 
overflow or avoid frequent garbage co [...]
+
+- CONNECTION_STRICTLY: This mode is based on the premise that the database 
strictly controls the number of connections it consumes in a single operation. 
If the actual SQL execution requires an operation on 200 tables in a database 
instance, only a unique database connection will be created and its 200 tables 
will be processed serially. If the sharding is on different databases, it is 
still multi-threaded to process the different libraries, but still only one 
unique database connection i [...]
+
+MEMORY_STRICTLY is used for OLAP operations, where the system throughput can 
be increased by relaxing the restrictions on database connections; 
connection-limited mode is used for OLTP operations, where OLTP is often routed 
to a single slice with a slice key, so it is wise to keep database connections 
tightly controlled to ensure that the online system's database resources can be 
used by more applications.
+
+ShardingSphere leaves the decision of which mode to use to the user, and 
provides configuration of the connection mode, allowing the developer to choose 
between MEMORY_STRICTLY or CONNECTION_STRICTLY.
+
+However, leaving the hard decision to the user makes it necessary for the user 
to understand the pros and cons of the two modes and make choices based on the 
needs of the business scenario. This obviously increases the user's cost of 
learning and using, which is not an optimal solution. 
+
+In addition, this processing scheme leaves the switching of the two modes to a 
static initialization configuration, which lacks flexibility. In actual usage 
scenarios, facing different SQL and placeholder parameters, each routing result 
is different. This means that some operations may need to use memory merging, 
and some operations may better choose streaming merging. They should not be 
configured by the user before ShardingSphere is started, but should be based on 
SQL and placeholder p [...]
+
+ShardingSphere always considers problems from the user's point of view and 
keeps optimizing and diligent, must make relevant optimization adjustments, so 
the automated execution engine has evolved.
+
+In order to reduce the using cost and connect dynamic mode, ShardingSphere has 
refined the idea of an automated execution engine and digested the concept of 
the connection mode inside. The user does not need to know what the so-called 
memory limit mode and connection limit mode are, but the execution engine 
automatically selects the optimal execution plan according to the current 
scenario. 
+
+At the same time, the automated execution engine refines the selection of the 
connection mode to every SQL operation. For each SQL request, the automated 
execution engine will perform real-time calculations and trade-offs based on 
its routing results, and autonomously adopt the appropriate connection mode for 
execution to achieve the optimal balance of resource control and efficiency. 
For the automated execution engine, users only need to configure 
maxConnectionSizePerQuery. This paramet [...]
+
+### Implementation analysis 
+
+The execution flow of the entire automated execution engine is shown below：
+
+![](https://shardingsphere.apache.org/blog/img/engine5.png)
+
+After the route rewrite is complete, we get the route result, which is a 
result set that contains mainly information about SQL, SQL's parameter set, 
database, and so on. Its data structure is shown in the following figure.
+
+![](https://shardingsphere.apache.org/blog/img/engine2.jpg)
+
+The execution process of the execution engine is divided into two phases: 
preparation and execution.
+
+*   Preparation phase
+
+As the name implies, this stage is used to prepare the data for execution. It 
is divided into two steps: result set grouping and execution unit creation.
+
+ a. Result set grouping
+ 
+This step is key to implementing the concept of internalized connection 
patterns. The execution engine automatically selects the appropriate connection 
mode based on the maxConnectionSizePerQuery configuration item, combined with 
the current routing results. The specific steps are as follows：
+
+- Group the SQL routing results according to the name of the database.
+- Obtain the set of SQL routing results that each database instance needs to 
perform for each database connection within the allowable limits of 
maxConnectionSizePerQuery by using the formula below and evolve the optimal 
connection pattern for this request.
+
+![](https://shardingsphere.apache.org/blog/img/engine6.png)
+
+Within the limits allowed by maxConnectionSizePerQuery, when the number of 
requests to be executed for a connection is greater than 1, it means that the 
current database connection cannot hold the corresponding data result set and 
must use in-memory merging; conversely, when the number of requests to be 
executed for a connection is equal to 1, it means that the current database 
connection can hold the corresponding data result set and can use streaming 
merging.
+
+The choice of the connection pattern is made for each physical database each 
time. In other words, if routed to more than one database in the same query, 
the connection patterns are not necessarily the same for each database; they 
may be in mixed form.
+
+ b. Execution of unit creation
+
+This step creates units for execution through the route grouping results 
obtained in the previous step. The execution unit is the unit that creates the 
corresponding database connection for each route grouping result.
+
+When the database is limited to the number of connection resources and there 
are a large number of concurrent operations in the online business, deadlocks 
are likely to be sent if the concurrent acquisition of database connections is 
not handled properly. When multiple requests are waiting for each other to 
release database connection resources, there is a starvation wait, resulting in 
cross deadlocks.
+
+For example, suppose that a single query needs to fetch 2 database connections 
on a particular database for routing to 2 sub-table queries of a library. It is 
possible that query A has already obtained 1 database connection on that 
database and is waiting to obtain the other database connection, while query B 
has also obtained 1 database connection on that database and is also waiting 
for the other database connection to be obtained. If the maximum number of 
connections allowed to the da [...]
+
+![](https://shardingsphere.apache.org/blog/img/engine7.png)
+
+In order to avoid deadlock, ShardingSphere synchronizes the database 
connection. When creating the execution unit, it obtains all the database 
connections required by this SQL request in an atomic way, which eliminates the 
possibility of obtaining some resources for each query request. This locking 
method can solve the deadlock problem, but it will bring a certain degree of 
concurrent performance loss. To show that we are different! What's the 
difference?  
+
+In view of this problem, we also carried out the following two aspects of 
Optimization:          
+
+1\.    Avoid locking and only need to obtain one database connection at a 
time. Because only one connection needs to be obtained at a time, there is no 
need to lock two requests waiting for each other. For most OLTP operations, the 
fragmentation key is used to route to the only data node. At this time, there 
is no need to worry about cross deadlock and lock addition, so as to reduce the 
impact on concurrent efficiency. In addition to routing to a single fragment, 
read-write separation al [...]
+
+2\.    Lock linked resources only for memory limited mode. When using the 
connection restriction mode, the database connection resources are released 
after all query result sets are loaded into memory, so it is not necessary to 
consider deadlock waiting and locking processing. 
+           
+*   Implementation phase
+
+This stage is used for real SQL execution. It is divided into two steps: 
grouping execution and merging result set generation.        
+    
+ a. Group execution
+ 
+In this step, the execution units generated in the preparation phase are 
distributed to the underlying concurrent execution engine, and events are sent 
for each key step in the execution process. For example: execution start event, 
execution success event and execution failure event. The execution engine only 
cares about the sending of events, it doesn't care about the subscribers of the 
events. Other ShardingSphere modules, such as distributed transactions and call 
link tracing, subscri [...]
+         
+ b. Merge result set generation  
+ 
+ShardingSphere generates memory merge result set or stream merge result set 
through connection mode acquired in execution preparation phase, and passes it 
to result merging engine for further work. The core difference between memory 
merge result set and stream merge result set is: stream merge result set gets 
data of result set by cursor, while memory merge result set gets data from 
memory. This is also the data base of memory merge and stream merge.    
+
+Through all the above steps, the execution process of the automatic execution 
engine is completed. Its core purpose is to automatically balance the database 
connection creation and result merge mode selection, to achieve a fine-grained 
balance between resource control and execution efficiency of each query, so as 
to reduce the user's learning cost and worry about the change of business 
scenarios. 
+
+The article you reading is the result of my repeated modifications. I cannot 
be rewarded, just hope that you can get something from reading, which is the 
meaning of our coding and writing. In the future, there will be other articles 
in the analysis of ShardingSphere series. Please pay attention~

[shardingsphere] branch master updated: Update engine.en.md (#8229)

Reply via email to