This is an automated email from the ASF dual-hosted git repository.
hanahmily pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git
The following commit(s) were added to refs/heads/main by this push:
new 91f0b712 Refactor distributed query (#321)
91f0b712 is described below
commit 91f0b712a95cecc3df002b48ec238a1f1a065854
Author: Gao Hongtao <[email protected]>
AuthorDate: Thu Aug 31 15:41:39 2023 +0800
Refactor distributed query (#321)
* Refactor distributed query
Signed-off-by: Gao Hongtao <[email protected]>
---
docs/concept/clustering.md | 75 ++++++++++++++++++----------------------------
1 file changed, 29 insertions(+), 46 deletions(-)
diff --git a/docs/concept/clustering.md b/docs/concept/clustering.md
index 0f8b7246..46141d73 100644
--- a/docs/concept/clustering.md
+++ b/docs/concept/clustering.md
@@ -1,17 +1,17 @@
# BanyanDB Clustering
-BanyanDB Clustering introduces a robust and scalable architecture that
comprises "Query Nodes", "Liaison Nodes", "Data Nodes", and "Meta Nodes". This
structure allows for effectively distributing and managing time-series data
within the system.
+BanyanDB Clustering introduces a robust and scalable architecture that
comprises "Liaison Nodes", "Data Nodes", and "Meta Nodes". This structure
allows for effectively distributing and managing time-series data within the
system.
## 1. Architectural Overview
-A BanyanDB installation includes four distinct types of nodes: Data Nodes,
Meta Nodes, Query Nodes, and Liaison Nodes.
-
-
+A BanyanDB installation includes three distinct types of nodes: Data Nodes,
Meta Nodes, and Liaison Nodes.
### 1.1 Data Nodes
Data Nodes hold all the raw time series data, metadata, and indexed data. They
handle the storage and management of data, including streams and measures, tag
keys and values, as well as field keys and values.
+Data Nodes also handle the local query execution. When a query is made, it is
directed to a Liaison, which then interacts with Data Nodes to execute the
distributed query and return results.
+
In addition to persistent raw data, Data Nodes also handle TopN aggregation
calculation or other computational tasks.
### 1.2 Meta Nodes
@@ -21,27 +21,17 @@ Meta Nodes is implemented by etcd. They are responsible for
maintaining high-lev
- All nodes in the cluster
- All database schemas
-### 1.3 Query Nodes
+### 1.3 Liaison Nodes
-Query Nodes are responsible for handling computational tasks associated with
querying the database. They build query tasks and search for data from Data
Nodes.
+Liaison Nodes serve as gateways, routing traffic to Data Nodes. In addition to
routing, they also provide authentication, TTL, and other security services to
ensure secure and effective communication without the cluster.
-### 1.4 Liaison Nodes
+Liaison Nodes are also responsible for handling computational tasks associated
with distributed querying the database. They build query tasks and search for
data from Data Nodes.
-Liaison Nodes serve as gateways, routing traffic to Query Nodes and Data
Nodes. In addition to routing, they also provide authentication, TTL, and other
security services to ensure secure and effective communication without the
cluster.
-
-### 1.5 Standalone Mode
+### 1.4 Standalone Mode
BanyanDB integrates multiple roles into a single process in the standalone
mode, making it simpler and faster to deploy. This mode is especially useful
for scenarios with a limited number of data points or for testing and
development purposes.
-In this mode, the single process performs the roles of the Liaison Node, Query
Node, Data Node, and Meta Node. It receives requests, maintains metadata,
processes queries, and handles data, all within a unified setup.
-
-### 1.6 Mix Mode in Storage Nodes
-
-Query nodes and data nodes are implemented by a same executable binary,
Storage Node. With the flag "mode", the storage node can be started as a query
node or a data node. The default mode is "mix", which means the storage node is
both a query node and a data node.
-
-If the workload of query is high, you can start more storage nodes with the
flag "mode" set to "query". If the workload of write is high, you can start
more storage nodes with the flag "mode" set to "data".
-
-Or you can start storage nodes with the flag "mode" set to "mix" to balance
the workload of query and write.
+In this mode, the single process performs the roles of the Liaison Node, Data
Node, and Meta Node. It receives requests, maintains metadata, processes
queries, and handles data, all within a unified setup.
## 2. Communication within a Cluster
@@ -49,12 +39,11 @@ All nodes within a BanyanDB cluster communicate with other
nodes according to th
- Meta Nodes share high-level metadata about the cluster.
- Data Nodes store and manage the raw time series data and communicate with
Meta Nodes.
-- Query Nodes interact with Data Nodes to execute queries and return results
to the Liaison Nodes.
-- Liaison Nodes distribute incoming requests to the appropriate Query Nodes or
Data Nodes.
+- Liaison Nodes distribute incoming data to the appropriate Data Nodes. They
also handle distributed query execution and communicate with Meta Nodes.
### Nodes Discovery
-All nodes in the cluster are discovered by the Meta Nodes. When a node starts
up, it registers itself with the Meta Nodes. The Meta Nodes then share this
information with the Liaison Nodes and Query Nodes, which use it to route
requests to the appropriate nodes.
+All nodes in the cluster are discovered by the Meta Nodes. When a node starts
up, it registers itself with the Meta Nodes. The Meta Nodes then share this
information with the Liaison Nodes which use it to route requests to the
appropriate nodes.
## 3. **Data Organization**
@@ -70,17 +59,19 @@ By storing shard allocation information, Meta Nodes help
ensure that data is rou
Data Nodes store all raw time series data, metadata, and indexed data. On
disk, the data is organized by `<group>/shard-<shard_id>/<segment_id>/`. The
segment is designed to support retention policy.
-### 3.3 Query Nodes
-
-Query Nodes do not store data. They handle the computational tasks associated
with data queries, interacting directly with Data Nodes to execute queries and
return results.
-
-### 3.4 Liaison Nodes
+### 3.3 Liaison Nodes
Liaison Nodes do not store data but manage the routing of incoming requests to
the appropriate Query or Data Nodes. They also provide authentication, TTL, and
other security services.
+They also handle the computational tasks associated with data queries,
interacting directly with Data Nodes to execute queries and return results.
+
## 4. **Determining Optimal Node Counts**
-When creating a BanyanDB cluster, choosing the appropriate number of each node
type to configure and connect is crucial. The number of Meta Nodes should
always be odd, for instance, “3”. The number of Data Nodes scales based on your
storage and query needs. The number of Query Nodes and Liaison Nodes depends on
the expected query load and routing complexity.
+When creating a BanyanDB cluster, choosing the appropriate number of each node
type to configure and connect is crucial. The number of Meta Nodes should
always be odd, for instance, “3”. The number of Data Nodes scales based on your
storage and query needs. The number of Liaison Nodes depends on the expected
query load and routing complexity.
+
+If the write and read load is from different sources, it is recommended to
separate the Liaison Nodes for write and read. For instance, if the write load
is from metrics, trace or log collectors and the read load is from a web
application, it is recommended to separate the Liaison Nodes for write and read.
+
+This separation allows for more efficient routing of requests and better
performance. It also allows for scaling out of the cluster based on the
specific needs of each type of request. For instance, if the write load is
high, you can scale out the write Liaison Nodes to handle the increased load.
The BanyanDB architecture allows for efficient clustering, scaling, and high
availability, making it a robust choice for time series data management.
@@ -142,15 +133,15 @@ This architecture allows BanyanDB to execute write
requests efficiently across a
## 6. Queries in a Cluster
-BanyanDB utilizes a distributed architecture that allows for efficient query
processing. When a query is made, it is directed to a Query Node.
+BanyanDB utilizes a distributed architecture that allows for efficient query
processing. When a query is made, it is directed to a Liaison Node.
### 6.1 Query Routing
-Query Nodes differ from Liaison Nodes in that they do not store shard mapping
information from Meta Nodes. Instead, they access all Data Nodes to retrieve
the necessary data for queries. As the query load is lower, it is practical for
query nodes to access all data nodes for this purpose. It may increase network
traffic, but simplifies scaling out of the cluster.
+Liaison Nodes do not use shard mapping information from Meta Nodes to execute
distributed queries. Instead, they access all Data Nodes to retrieve the
necessary data for queries. As the query load is lower, it is practical for
liaison nodes to access all data nodes for this purpose. It may increase
network traffic, but simplifies scaling out of the cluster.
Compared to the write load, the query load is relatively low. For instance, in
a time series database, the write load is typically 100x higher than the query
load. This is because the write load is driven by the number of devices sending
data to the database, while the query load is driven by the number of users
accessing the data.
-This strategy enables scaling out of the cluster. When the cluster scales out,
the query node can access all data nodes without any mapping info changes. It
eliminates the need to backup previous shard mapping information, reducing
complexity of scaling out.
+This strategy enables scaling out of the cluster. When the cluster scales out,
the liaison node can access all data nodes without any mapping info changes. It
eliminates the need to backup previous shard mapping information, reducing
complexity of scaling out.
### 6.2 Query Execution
@@ -167,16 +158,9 @@ User
|
v
------------------------------------
-| Liaison Node | <--- Routes the User's Request
-| (Routes the request to the Query Node)|
-------------------------------------
- |
- | API Request (Query)
- |
- v
-------------------------------------
-| Query Node | <--- Stateless Node
-| (Identify relevant Data Nodes) |
+| Liaison Node | <--- Stateless Node, Distributes Query
+| (Access all Data nodes to |
+| execute distributed queries) |
------------------------------------
| | |
v v v
@@ -188,9 +172,8 @@ User
```
1. A user makes an API request to the Liaison Node. This request may be a
query for specific data.
-2. The Liaison Node routes the request to the appropriate Query Node.
-3. The Query Node, which is stateless, select all data nodes.
-4. The query is executed in parallel across all Data Nodes. Each Data Node
processes the data stored in its shard concurrently with the others.
-5. The results from each shard are then returned to the Query Node, which
consolidates them into a single response to the user.
+2. The Liaison Node builds a distributed query to select all data nodes.
+3. The query is executed in parallel across all Data Nodes. Each Data Node
execute a local query plan to process the data stored in its shard concurrently
with the others.
+4. The results from each shard are then returned to the Liaison Node, which
consolidates them into a single response to the user.
-This architecture allows BanyanDB to execute queries efficiently across a
distributed system, leveraging the routing capabilities of the Liaison Node,
the stateless nature of Query Nodes, and the parallel processing of Data Nodes.
+This architecture allows BanyanDB to execute queries efficiently across a
distributed system, leveraging the distributed query capabilities of the
Liaison Node and the parallel processing of Data Nodes.