+1 On Dec 1, 2023 at 12:28:23, Murtadha Al-Hubail <[email protected]> wrote:
> Each AsterixDB cluster today consists of one or more Node Controllers (NC) > where the data is stored and processed. Each NC has a predefined set of > storage partitions (iodevices). When data is ingested into the system, the > data is hash-partitioned across the total number of storage partitions in > the cluster. Similarly, when the data is queried, each NC will start as > many threads as the number storage partitions it has to read and process > the data in parallel. While this shared-nothing architecture has its > advantages, it has its drawbacks too. One major drawback is the time needed > to scale the cluster. Adding a new NC to an existing cluster of (n) nodes > means writing a completely new copy of the data which will now be > hash-partitioned to the new total number of storage partitions of (n + 1) > nodes. This operation could potentially take several hours or even days > which is unacceptable in the cloud age. > > This APE is about adding a new deployment (cloud) mode to AsterixDB by > implementing compute-storage separation to take advantage of the elasticity > of the cloud. This will require the following: > > 1. Moving from the dynamic data partitioning described earlier to a static > data partitioning based on a configurable, but fixed during a cluster's > life, number of storage partitions. > 2. Introducing the concept of a "compute partition" where each NC will have > a fixed number of compute partitions. This number could potentially be > based on the number of CPU cores it has. > > This will decouple the number of storage partitions being processed on an > NC from the number of its compute partitions. > > When an AsterixDB cluster is deployed using the cloud mode, we will do the > following: > > - The Cluster Controller will maintain a map containing the assignment of > storage partitions to compute partitions. > - New writes will be written to the NC's local storage and uploaded to an > object store (e.g. AWS S3) which will be used as a highly available shared > filesystem between NCs. > - On queries, each NC will start as many threads as its compute partitions > to process its currently assigned storage partitions. > - On scaling operations, we will simply update the assignment map and NCs > will lazily cache any data of newly assigned storage partitions from the > object store. > > Improvement tickets: > Static data partitioning: > https://issues.apache.org/jira/browse/ASTERIXDB-3144 > Compute-Storage Separation > https://issues.apache.org/jira/browse/ASTERIXDB-3196 > > Please vote on this APE. We'll keep this open for 72 hours and pass with > either 3 votes or a majority of positive votes. >
