This is an automated email from the ASF dual-hosted git repository. djwang pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/cloudberry-site.git
commit 93aadc6f6fc749658465423b3217785bf750c4ae Author: Dianjin Wang <[email protected]> AuthorDate: Fri Nov 29 12:15:55 2024 +0800 Update the name and URLs in Bootcamp --- src/consts/bootcamp.tsx | 8 ----- ...on-to-database-and-cloudberrydb-architecture.md | 42 +++++++++++----------- src/pages/bootcamp/101-1-create-users-and-roles.md | 14 ++++---- .../bootcamp/101-2-create-and-prepare-database.md | 10 +++--- src/pages/bootcamp/101-3-create-tables.md | 10 +++--- src/pages/bootcamp/101-4-data-loading.md | 32 ++++++++--------- .../101-5-queries-and-performance-tuning.md | 36 +++++++++---------- .../101-6-backup-and-recovery-operations.md | 14 ++++---- src/pages/bootcamp/102-cbdb-crash-course.md | 28 +++++++-------- .../103-cbdb-performance-benchmark-tpcds.md | 12 +++---- .../103-cbdb-performance-benchmark-tpch.md | 10 +++--- ...uction-to-cloudberrydb-in-database-analytics.md | 14 ++++---- src/pages/bootcamp/104-2-hashml-for-datascience.md | 6 ---- src/pages/bootcamp/cbdb-sandbox.md | 18 +++++----- 14 files changed, 120 insertions(+), 134 deletions(-) diff --git a/src/consts/bootcamp.tsx b/src/consts/bootcamp.tsx index 6a5673ba..03135c2e 100644 --- a/src/consts/bootcamp.tsx +++ b/src/consts/bootcamp.tsx @@ -132,14 +132,6 @@ let BOOTCAMP_PAGE_CONFIG = { href: "/bootcamp/104-1-introduction-to-cloudberrydb-in-database-analytics", }, }, - { - title: "104-2", - style: { width: 474 }, - link: { - text: "HashML for Data Science", - href: "/bootcamp/104-2-hashml-for-datascience", - }, - }, ], }, GET_SOURCE: { diff --git a/src/pages/bootcamp/101-0-introduction-to-database-and-cloudberrydb-architecture.md b/src/pages/bootcamp/101-0-introduction-to-database-and-cloudberrydb-architecture.md index d377af79..08737bf1 100644 --- a/src/pages/bootcamp/101-0-introduction-to-database-and-cloudberrydb-architecture.md +++ b/src/pages/bootcamp/101-0-introduction-to-database-and-cloudberrydb-architecture.md @@ -1,6 +1,6 @@ --- -title: "[101-0] Lesson 0: Introduction to Database and CloudberryDB Architecture" -description: This page provides an introduction to the basic concepts of databases and explains the architecture of Cloudberry Database. +title: "[101-0] Lesson 0: Introduction to Database and Apache Cloudberry Architecture" +description: This page provides an introduction to the basic concepts of databases and explains the architecture of Apache Cloudberry. --- ## Background: Database Concepts @@ -23,49 +23,49 @@ The database system needs to maintain some metadata - called the database catalo SQL (Structured Query Language) is a descriptive language, not imperative language. Therefore it describes what the user needs, not how to get it. When the user describes what he needs, the database need to decide how to get it. This process is called query optimization. The end result from this process is a query plan, which is a step by step instruction how to get the result. -## Introduction to the Cloudberry Database Architecture +## Introduction to the Apache Cloudberry Architecture -Cloudberry Database (or CloudberryDB) is a massively parallel processing (MPP) database server with an architecture specially designed to manage large-scale analytic data warehouses and business intelligence workloads. +Apache Cloudberry is a massively parallel processing (MPP) database server with an architecture specially designed to manage large-scale analytic data warehouses and business intelligence workloads. -MPP (also known as a shared-nothing architecture) refers to systems with two or more processors that cooperate to carry out an operation, each processor with its own memory, operating system, and disks. Cloudberry Database uses this high-performance system architecture to distribute the load of multi-terabyte data warehouses and all of a system's resources in parallel to process a query. +MPP (also known as a shared-nothing architecture) refers to systems with two or more processors that cooperate to carry out an operation, each processor with its own memory, operating system, and disks. Apache Cloudberry uses this high-performance system architecture to distribute the load of multi-terabyte data warehouses and all of a system's resources in parallel to process a query. -Cloudberry Database is based on the open-source PostgreSQL. It is essentially several PostgreSQL database instances working together as one cohesive database management system (DBMS). It is based on PostgreSQL 14.4 kernel and in most cases it is very similar to PostgreSQL. Database users interact with Cloudberry Database as a regular PostgreSQL DBMS. +Apache Cloudberry is based on the open-source PostgreSQL. It is essentially several PostgreSQL database instances working together as one cohesive database management system (DBMS). It is based on PostgreSQL 14.4 kernel and in most cases it is very similar to PostgreSQL. Database users interact with Apache Cloudberry as a regular PostgreSQL DBMS. -In CloudberryDB, internals of PostgreSQL have been modified and optimized to support parallel structure of Cloudberry Database. For instance, system catalog, optimizer, query executor and transaction manager components have been modified and enhanced to be able to execute queries simultaneously across the parallel PostgreSQL database instances. CloudberryDB interconnect (the networking layer) enables communication between distinct PostgreSQL instances and allows the system to behave as o [...] +In Apache Cloudberry, internals of PostgreSQL have been modified and optimized to support parallel structure of Apache Cloudberry. For instance, system catalog, optimizer, query executor and transaction manager components have been modified and enhanced to be able to execute queries simultaneously across the parallel PostgreSQL database instances. Apache Cloudberry interconnect (the networking layer) enables communication between distinct PostgreSQL instances and allows the system to beh [...] -Cloudberry Database also includes features designed to optimize PostgreSQL for business intelligence (BI) workloads. For example, CloudberryDB has added parallel data loading (external tables), resource management, query optimizations and storage enhancements,. +Apache Cloudberry also includes features designed to optimize PostgreSQL for business intelligence (BI) workloads. For example, Apache Cloudberry has added parallel data loading (external tables), resource management, query optimizations and storage enhancements,. -_Figure 1. High-Level Cloudberry Database Architecture_ +_Figure 1. High-Level Apache Cloudberry Architecture_ - + -The following topics describe the components that make up a Cloudberry Database system and how they work together. +The following topics describe the components that make up a Apache Cloudberry system and how they work together. -### CloudberryDB Master (Coordinator) +### Apache Cloudberry Master (Coordinator) :::note -In the latest build of Cloudberry Database, the name "Master" has been deprecated, and "Coordinator" has been used instead. You are expected to see "coordinator" in the database output. +In the latest build of Apache Cloudberry, the name "Master" has been deprecated, and "Coordinator" has been used instead. You are expected to see "coordinator" in the database output. ::: -The Cloudberry Database master is the entry to the Cloudberry Database system. It accepts client connections, handles SQL queries, and then distributes workload to the segment instances. +The Apache Cloudberry master is the entry to the Apache Cloudberry system. It accepts client connections, handles SQL queries, and then distributes workload to the segment instances. -Cloudberry Database end-users only interact with Cloudberry Database through master node as a typical PostgreSQL database. They connect to database using client such as psql or drivers like JDBC or ODBC. +Apache Cloudberry end-users only interact with Apache Cloudberry through master node as a typical PostgreSQL database. They connect to database using client such as psql or drivers like JDBC or ODBC. -The master stores global system catalog. Global system catalog is set of system tables that contain metadata for Cloudberry Database itself. Master node does not contain any user table data; user table data resides only on segments. Master node would authenticate client connections, processe incoming SQL commands, distribute workloads among segments, collect the results returned by each segment and return the final results to the client. +The master stores global system catalog. Global system catalog is set of system tables that contain metadata for Apache Cloudberry itself. Master node does not contain any user table data; user table data resides only on segments. Master node would authenticate client connections, processe incoming SQL commands, distribute workloads among segments, collect the results returned by each segment and return the final results to the client. -### CloudberryDB Segments +### Apache Cloudberry Segments -Cloudberry Database segment instances are independent PostgreSQL databases that each of them stores a portion of the data and performs the majority of query execution work. +Apache Cloudberry segment instances are independent PostgreSQL databases that each of them stores a portion of the data and performs the majority of query execution work. When a user connects to the database via the Cloudberry master and issues queries, accordingly execution plan would be distributed to each segment instance. -The server that has segments running on it is called segment host. A segment host usually has two to eight Cloudberry segments running on it, the number depending on serveral factors: CPU cores, memory, disk, network interfaces or workloads. To get better performance in Cloudberry Database, it is suggested to distribute data and workloads evenly across segments so that execution plan can be finished across all segments and with no bottleneck. +The server that has segments running on it is called segment host. A segment host usually has two to eight Cloudberry segments running on it, the number depending on serveral factors: CPU cores, memory, disk, network interfaces or workloads. To get better performance in Apache Cloudberry, it is suggested to distribute data and workloads evenly across segments so that execution plan can be finished across all segments and with no bottleneck. -### CloudberryDB Interconnect +### Apache Cloudberry Interconnect -The interconnect is the networking layer of the Cloudberry Database architecture. +The interconnect is the networking layer of the Apache Cloudberry architecture. The interconnect refers to the inter-process communication mechanism in-between segments. By default, interconnect uses User Datagram Protocol (UDP) to send/receive messages over the network. Interconnect provide datagram verification and retransmission mechanism. Reliability is equivalent to Transmission Control Protocol (TCP), performance and scalability exceeds TCP. If a user chooses TCP in interconnect, Cloudberry would have limit around 1000 segment instances. With UDP and interconn [...] diff --git a/src/pages/bootcamp/101-1-create-users-and-roles.md b/src/pages/bootcamp/101-1-create-users-and-roles.md index 4ddb90b3..8277f32a 100644 --- a/src/pages/bootcamp/101-1-create-users-and-roles.md +++ b/src/pages/bootcamp/101-1-create-users-and-roles.md @@ -1,9 +1,9 @@ --- title: "[101-1] Lesson 1: Create Users and Roles" -description: Learn how to create users and roles in the Cloudberry Database with this helpful introduction. +description: Learn how to create users and roles in the Apache Cloudberry with this helpful introduction. --- -Cloudberry Database manages database access using roles. Initially, there is one superuser role, the role associated with the OS user who initialized the database instance, usually `gpadmin`. This user owns all of the Cloudberry Database files and OS processes, so it is important to reserve the `gpadmin` role for system tasks only. +Apache Cloudberry manages database access using roles. Initially, there is one superuser role, the role associated with the OS user who initialized the database instance, usually `gpadmin`. This user owns all of the Apache Cloudberry files and OS processes, so it is important to reserve the `gpadmin` role for system tasks only. A role can be a user or a group. A user role can log into a database; that is, it has the `LOGIN` attribute. A user or group role can become a member of a group. @@ -13,11 +13,11 @@ Permissions can be granted to users or groups. Initially, only the `gpadmin` rol You can follow the examples below to create users and roles. -Before moving on to the operations, make sure that you have installed Cloudberry Database by following [Install a Single-Node Cloudberry Database](./cbdb-sandbox). +Before moving on to the operations, make sure that you have installed Apache Cloudberry by following [Install a Single-Node Apache Cloudberry](./cbdb-sandbox). ### Create a user using the CREATE USER command -1. Log into Cloudberry Database in Docker. Connect to the database as the `gpadmin` user. +1. Log into Apache Cloudberry in Docker. Connect to the database as the `gpadmin` user. ```shell [gpadmin@mdw ~]$ psql @@ -152,7 +152,7 @@ Before moving on to the operations, make sure that you have installed Cloudberry users | Cannot login | {} ``` -However, after creating the `users` group, `lily` and `lucy` cannot log into Cloudberry Database yet. See the following error messages. +However, after creating the `users` group, `lily` and `lucy` cannot log into Apache Cloudberry yet. See the following error messages. ```shell [gpadmin@mdw ~]$ psql -U lily -d gpadmin @@ -181,7 +181,7 @@ To make users (`lily` and `lucy`) able to log into the database, you need to adj > **Info:** > - > - `pg_hba.conf` is a configuration file in Cloudberry Database to control access permissions. + > - `pg_hba.conf` is a configuration file in Apache Cloudberry to control access permissions. > - `md5` and `trust` are the authentication methods. `md5` means that the user needs to enter the password to log in. `trust` means that the user can log in without entering the password. 2. Use `gpstop` to populate the change. @@ -195,7 +195,7 @@ To make users (`lily` and `lucy`) able to log into the database, you need to adj 20230818:14:16:05:003653 gpstop:mdw:gpadmin-[INFO]:-Gathering information and validating the environment... 20230818:14:16:05:003653 gpstop:mdw:gpadmin-[INFO]:-Obtaining Cloudberry Coordinator catalog information 20230818:14:16:05:003653 gpstop:mdw:gpadmin-[INFO]:-Obtaining Segment details from coordinator... - 20230818:14:16:05:003653 gpstop:mdw:gpadmin-[INFO]:-Cloudberry Version: 'postgres (Cloudberry Database) 1.0.0 build dev' + 20230818:14:16:05:003653 gpstop:mdw:gpadmin-[INFO]:-Cloudberry Version: 'postgres (Apache Cloudberry) 1.0.0 build dev' 20230818:14:16:05:003653 gpstop:mdw:gpadmin-[INFO]:-Signalling all postmaster processes to reload ``` diff --git a/src/pages/bootcamp/101-2-create-and-prepare-database.md b/src/pages/bootcamp/101-2-create-and-prepare-database.md index cef57283..92a02e2f 100644 --- a/src/pages/bootcamp/101-2-create-and-prepare-database.md +++ b/src/pages/bootcamp/101-2-create-and-prepare-database.md @@ -1,9 +1,9 @@ --- title: "[101-2] Lesson 2: Create and Prepare Database" -description: Let's create one new database in the Cloudberry Database. +description: Let's create one new database in the Apache Cloudberry. --- -To create a new database in Cloudberry Database, you can either use the `CREATE DATABASE` SQL command in the `psql` client, or use the `createdb` utility. The `createdb` utility is a wrapper around the `CREATE DATABASE` command. +To create a new database in Apache Cloudberry, you can either use the `CREATE DATABASE` SQL command in the `psql` client, or use the `createdb` utility. The `createdb` utility is a wrapper around the `CREATE DATABASE` command. ## Quick-start operations @@ -13,7 +13,7 @@ Before moving on to the operations, make sure that you have completed the previo ### Create database -1. Log into Cloudberry Database in Docker. Before creating the `tutorial` database, make sure that this database does not exist. +1. Log into Apache Cloudberry in Docker. Before creating the `tutorial` database, make sure that this database does not exist. ```shell [gpadmin@mdw ~]$ dropdb tutorial @@ -58,7 +58,7 @@ Before moving on to the operations, make sure that you have completed the previo > **Info:** > - > - `pg_hba.conf` is the configuration file for client access control in Cloudberry Database. + > - `pg_hba.conf` is the configuration file for client access control in Apache Cloudberry. > - `md5` is the authentication methods, which means that the user needs to enter the password to log in. @@ -73,7 +73,7 @@ Before moving on to the operations, make sure that you have completed the previo 20230818:14:18:45:003733 gpstop:mdw:gpadmin-[INFO]:-Gathering information and validating the environment... 20230818:14:18:45:003733 gpstop:mdw:gpadmin-[INFO]:-Obtaining Cloudberry Coordinator catalog information 20230818:14:18:45:003733 gpstop:mdw:gpadmin-[INFO]:-Obtaining Segment details from coordinator... - 20230818:14:18:45:003733 gpstop:mdw:gpadmin-[INFO]:-Cloudberry Version: 'postgres (Cloudberry Database) 1.0.0 build dev' + 20230818:14:18:45:003733 gpstop:mdw:gpadmin-[INFO]:-Cloudberry Version: 'postgres (Apache Cloudberry) 1.0.0 build dev' 20230818:14:18:45:003733 gpstop:mdw:gpadmin-[INFO]:-Signalling all postmaster processes to reload ``` diff --git a/src/pages/bootcamp/101-3-create-tables.md b/src/pages/bootcamp/101-3-create-tables.md index a71d46f5..0ee97cb9 100644 --- a/src/pages/bootcamp/101-3-create-tables.md +++ b/src/pages/bootcamp/101-3-create-tables.md @@ -1,23 +1,23 @@ --- title: "[101-3] Lesson 3: Create Tables" -description: Learn how to create tables in the Cloudberry Database. +description: Learn how to create tables in the Apache Cloudberry. --- After creating and preparing a database in [Lesson 2: Create and Prepare a Database](./101-2-create-and-prepare-database), you can start to create tables in the database. :::note -To introduce Cloudberry Database, we use a public data set, the Airline On-Time Statistics and Delay Causes data set, published by the United States Department of Transportation at http://www.transtats.bts.gov/. The On-Time Performance dataset records flights by date, airline, originating airport, destination airport, and many other flight details. The data is available for flights since 1987. The exercises in this guide use data for about a million flights in 2009 and 2010. You are enco [...] +To introduce Apache Cloudberry, we use a public data set, the Airline On-Time Statistics and Delay Causes data set, published by the United States Department of Transportation at http://www.transtats.bts.gov/. The On-Time Performance dataset records flights by date, airline, originating airport, destination airport, and many other flight details. The data is available for flights since 1987. The exercises in this guide use data for about a million flights in 2009 and 2010. You are encour [...] ::: ## Create tables using a SQL file in psql -In Cloudberry Database, you can use the `CREATE TABLE` SQL statement to create a table. +In Apache Cloudberry, you can use the `CREATE TABLE` SQL statement to create a table. In the following steps, you will be guided to run a SQL file `create_dim_tables.sql` that contains the `CREATE TABLE` statements needed to create `faa` databases. -1. Log into Cloudberry Database in Docker as `gpadmin`. Then enter the `faa` directory, in which the SQL file `create_dim_tables.sql` is located. +1. Log into Apache Cloudberry in Docker as `gpadmin`. Then enter the `faa` directory, in which the SQL file `create_dim_tables.sql` is located. ```shell [gpadmin@mdw tmp]$ cd /tmp @@ -140,7 +140,7 @@ The definition of a table includes the distribution policy for the data, which i The distribution policy determines how data is distributed among segments. To get an effective distribution policy requires understanding of the data's characteristics, what kind of queries that would be executed on the data and what distribution strategy will best utilize the parallel execution capacity among segments. -Use the `DISTRIBUTED` clause in `CREATE TABLE` statement to define the distribution policy for a table. Ideally, each segment possesses an equal volume of data and performs equal share of work when queries run. There are 2 kinds of distribution policy syntax in Cloudberry Database: +Use the `DISTRIBUTED` clause in `CREATE TABLE` statement to define the distribution policy for a table. Ideally, each segment possesses an equal volume of data and performs equal share of work when queries run. There are 2 kinds of distribution policy syntax in Apache Cloudberry: - `DISTRIBUTED BY (column, ...)` defines a distribution key from one or more columns. A hash function applied to the distribution key determines which segment stores the corresponding row. Rows that have same distribution key are stored on the same segment. If the distribution keys are unique, the hash function will ensure that data is distributed evenly. The default distribution policy is a hash on the primary key of the table or the first column of table if no primary key is specified. - `DISTRIBUTED RANDOMLY` distributes rows in round-robin fashion among segments. diff --git a/src/pages/bootcamp/101-4-data-loading.md b/src/pages/bootcamp/101-4-data-loading.md index c00f2e59..d81c81af 100644 --- a/src/pages/bootcamp/101-4-data-loading.md +++ b/src/pages/bootcamp/101-4-data-loading.md @@ -1,19 +1,19 @@ --- title: "[101-4] Lesson 4: Data Loading" -description: Load your data to the Cloudberry Database. +description: Load your data to the Apache Cloudberry. --- -This tutorial briefly introduces 3 methods to load the example data `FAA` into Cloudberry Database tables you have created in the previous tutorial [Lesson 3: Create Tables](./101-3-create-tables). Before continuing, make sure you have completed the previous tutorial. +This tutorial briefly introduces 3 methods to load the example data `FAA` into Apache Cloudberry tables you have created in the previous tutorial [Lesson 3: Create Tables](./101-3-create-tables). Before continuing, make sure you have completed the previous tutorial. - Method 1: Use the `INSERT` statement. This is the easiest way to load data. You can execute `INSERT` directly in psql, run scripts that have `INSERT` statements, or run a client-side application with database connection. It is not recommended to use `INSERT` to load a large amount of data, because the loading efficiency is low. - Method 2: Use the SQL statement `COPY` to load data into database. The `COPY` syntax allows you to define the format of the text file so that data can be parsed into rows and columns. This method is faster than the `INSERT` statement. But, like `INSERT` statement, `COPY` is not a parallel data loading process. - The `COPY` statement requires that external files be accessible to the host where the master process is running. On a multi-node Cloudberry Database system, data files might reside on a file system that is not accessible from master node. In this case, you need to use the psql command `\copy meta-command` that streams data to Cloudberry master node over `psql` connection. Some example scripts in this tutorial use the `\copy meta-command`. + The `COPY` statement requires that external files be accessible to the host where the master process is running. On a multi-node Apache Cloudberry system, data files might reside on a file system that is not accessible from master node. In this case, you need to use the psql command `\copy meta-command` that streams data to Cloudberry master node over `psql` connection. Some example scripts in this tutorial use the `\copy meta-command`. -- Method 3: Use Cloudberry Database utilities to load external data into tables. When you are working with a large-scale data warehouse, you might often face the challenge of loading large amounts of data in a short time. The utilities, `gpfdist` and `gpload`, are tailored for this purpose, enabling you to achieve rapid, parallel data transfers. +- Method 3: Use Apache Cloudberry utilities to load external data into tables. When you are working with a large-scale data warehouse, you might often face the challenge of loading large amounts of data in a short time. The utilities, `gpfdist` and `gpload`, are tailored for this purpose, enabling you to achieve rapid, parallel data transfers. - During your data loading process, if any rows run into issues, they will be noted. You can set an error threshold that fits your needs. If the number of problematic rows exceeds this limit, Cloudberry Database will stop the loading process. + During your data loading process, if any rows run into issues, they will be noted. You can set an error threshold that fits your needs. If the number of problematic rows exceeds this limit, Apache Cloudberry will stop the loading process. For optimal speed, combine the use of external tables with the parallel file server (`gpfdist`). This approach will help you maximize efficiency, making your data loading tasks smoother and more efficient. @@ -34,7 +34,7 @@ The `faa.d_cancellation_codes` table is a simple 2-column look-up table. You wil <!-- You change to directory faa, containing FAA data and scripts, take a look at table faa.d_cancellation_codes, insert data into table. --> -1. Log into Cloudberry Database in Docker as `gpadmin`, and change to the `faa` directory. This directory contains `faa` data and scripts. +1. Log into Apache Cloudberry in Docker as `gpadmin`, and change to the `faa` directory. This directory contains `faa` data and scripts. ```shell [gpadmin@mdw ~]$ cd /tmp/faa @@ -152,11 +152,11 @@ The `COPY` statement moves data from the file system to database tables. Data fo For the `faa` fact table, you will use an ETL (Extract, Transform, Load) process to load data from the source gzip files into a data table. For the best loading speed, use the `gpfdist` utility to distribute rows to segments. -In production system, `gpfdist` runs on file servers that external data resides. However, for a single-node Cloudberry Database instance, there is only one logical host, so you run `gpfdist` on it as well. Starting `gpfdist` is similar as a file server, no data movement will occur until SQL query request has been ended. +In production system, `gpfdist` runs on file servers that external data resides. However, for a single-node Apache Cloudberry instance, there is only one logical host, so you run `gpfdist` on it as well. Starting `gpfdist` is similar as a file server, no data movement will occur until SQL query request has been ended. > **Note:** > -> This exercise loads data using `gpfdsit` to move data from external data files into Cloudberry Database. Moving data between the database and external tables also needs security request. Therefore, only superusers are permitted to use `gpfdsit` and you will complete this exercise as `gpadmin` user. +> This exercise loads data using `gpfdsit` to move data from external data files into Apache Cloudberry. Moving data between the database and external tables also needs security request. Therefore, only superusers are permitted to use `gpfdsit` and you will complete this exercise as `gpadmin` user. 1. Start `gpfdist`: @@ -239,7 +239,7 @@ The following operations are performed in this section: tutorial=# INSERT INTO faa.faa_otp_load SELECT * FROM faa.ext_load_otp; ``` - Note: Cloudberry Database facilitates moving data from the gzip files into the database's load table. In a production setting, there might be several `gpfdist` processes running, either on separate hosts or multiple on one host, each using a different port. + Note: Apache Cloudberry facilitates moving data from the gzip files into the database's load table. In a production setting, there might be several `gpfdist` processes running, either on separate hosts or multiple on one host, each using a different port. 3. Examine load errors. @@ -262,7 +262,7 @@ The following operations are performed in this section: ### Load data using the `gpload` utility -Cloudberry Database provides a wrapper program for `gpfdist` called `gpload` that does much of the work to set up external table and data movement. In this exercise, you will reload the `faa_otp_load` table using the gpload utility. +Apache Cloudberry provides a wrapper program for `gpfdist` called `gpload` that does much of the work to set up external table and data movement. In this exercise, you will reload the `faa_otp_load` table using the gpload utility. In this section, we walk through the process of loading data with `gpload`. The steps are: @@ -321,7 +321,7 @@ In this section, we walk through the process of loading data with `gpload`. The [gpadmin@mdw faa]$ gpload -f gpload.yaml -l gpload.log ``` - Summary: At the end of this guide, you would have successfully used gpload to load data into CloudberryDB. Make sure to check the logs for any warnings or errors to ensure data consistency and integrity. + Summary: At the end of this guide, you would have successfully used gpload to load data into Apache Cloudberry. Make sure to check the logs for any warnings or errors to ensure data consistency and integrity. ### Create and load fact tables @@ -351,14 +351,14 @@ tutorial=# - Key Feature: rapid data loading - - Extract, load, and t ransform (ELT): This method takes advantage of the massive parallelism of Cloudberry Database. + - Extract, load, and t ransform (ELT): This method takes advantage of the massive parallelism of Apache Cloudberry. - Staging: Data can be staged using methods like external tables. - - Transformation: Data transformations occur within the Cloudberry Database. + - Transformation: Data transformations occur within the Apache Cloudberry. - Performance: Set-based operations are done in parallel to maximize efficiency. - Loading mechanisms - - `COPY`: Loads data via the master in a single process, but doesn't harness CloudberryDB's parallel capabilities. + - `COPY`: Loads data via the master in a single process, but doesn't harness Apache Cloudberry's parallel capabilities. - External tables: - Advantage: Takes advantage of the parallel processing power of segments. - Flexibility: One `SELECT` statement can access multiple data sources. @@ -379,11 +379,11 @@ tutorial=# - There is a risk of data duplication, especially when extracting data from another database. - Users need to be cautious and verify data when using Web tables. -Understanding and using these features and mechanisms effectively can ensure optimal data loading and management within the Cloudberry Database. +Understanding and using these features and mechanisms effectively can ensure optimal data loading and management within the Apache Cloudberry. ## What's next -In this tutorial, you learned how to load data into Cloudberry Database. You learned about the different loading mechanisms and how to use them. You also learned how to use the `gpload` utility to load data. Finally, you learned how to create and load fact tables. You can now move on to the next tutorial, [Lesson 5: Queries and Performance Tuning](./101-5-queries-and-performance-tuning), to learn about query performance tuning in Cloudberry Database. +In this tutorial, you learned how to load data into Apache Cloudberry. You learned about the different loading mechanisms and how to use them. You also learned how to use the `gpload` utility to load data. Finally, you learned how to create and load fact tables. You can now move on to the next tutorial, [Lesson 5: Queries and Performance Tuning](./101-5-queries-and-performance-tuning), to learn about query performance tuning in Apache Cloudberry. Other tutorials: diff --git a/src/pages/bootcamp/101-5-queries-and-performance-tuning.md b/src/pages/bootcamp/101-5-queries-and-performance-tuning.md index e85ce71a..2a9f05cd 100644 --- a/src/pages/bootcamp/101-5-queries-and-performance-tuning.md +++ b/src/pages/bootcamp/101-5-queries-and-performance-tuning.md @@ -1,13 +1,13 @@ --- title: "[101-5] Lesson 5: Queries and Performance Tuning" -description: Understand the queries in the Cloudberry Database. +description: Understand the queries in the Apache Cloudberry. --- -This lesson provides an overview of how Cloudberry Database processes queries. Understanding this process can be useful when you write and tune queries. +This lesson provides an overview of how Apache Cloudberry processes queries. Understanding this process can be useful when you write and tune queries. ## Concepts -Users submit queries to Cloudberry Database as they would to any database management system. They connect to the database instance on the CloudberryDB master host using a client application such as psql and submit SQL statements. +Users submit queries to Apache Cloudberry as they would to any database management system. They connect to the database instance on the Apache Cloudberry master host using a client application such as psql and submit SQL statements. ### Understand query planning and dispatch @@ -21,21 +21,21 @@ _Figure 1. Dispatch the parallel query plan_ ### Understand query plans -A query plan is a set of operations Cloudberry Database will perform to produce the answer to a query. Each node or step in the plan represents a database operation such as a table scan, join, aggregation or sort. Plans are read and executed from bottom to top. +A query plan is a set of operations Apache Cloudberry will perform to produce the answer to a query. Each node or step in the plan represents a database operation such as a table scan, join, aggregation or sort. Plans are read and executed from bottom to top. -In addition to common database operations such as tables scan and join, Cloudberry Database has an additional operation type called "motion". A motion operation involves moving tuples between segments during query processing. +In addition to common database operations such as tables scan and join, Apache Cloudberry has an additional operation type called "motion". A motion operation involves moving tuples between segments during query processing. -To achieve maximum parallelism during query execution, Cloudberry Database divides the work of a query plan into slices. A slice is a portion of the plan that segments can work on independently. A query plan is sliced wherever a motion operation occurs in the plan with one slice on each side of the motion. +To achieve maximum parallelism during query execution, Apache Cloudberry divides the work of a query plan into slices. A slice is a portion of the plan that segments can work on independently. A query plan is sliced wherever a motion operation occurs in the plan with one slice on each side of the motion. ### Understand parallel query execution -Cloudberry Database creates a number of database processes to handle the work of a query. On the master, the query worker process is called "query dispatcher" or "QD". QD is responsible for creating and dispatching query plan. It also accumulates and presents the final results. On segments, a query worker process is called "query executor" or "QE". QE is responsible for completing its portion of work and communicating its intermediate results to other worker processes. +Apache Cloudberry creates a number of database processes to handle the work of a query. On the master, the query worker process is called "query dispatcher" or "QD". QD is responsible for creating and dispatching query plan. It also accumulates and presents the final results. On segments, a query worker process is called "query executor" or "QE". QE is responsible for completing its portion of work and communicating its intermediate results to other worker processes. There is at least one worker process assigned to each slice of the query plan. A worker process works on its assigned portion of the query plan independently. During query execution, each segment will have a number of processes working on the query in parallel. -Related processes that are working on the same slice of the query plan but on different segments are called "gangs". As a portion of work is completed, tuples flow up the query plan from one gang of processes to the next. This inter-process communication between segments is referred to as the interconnect component of Cloudberry Database. +Related processes that are working on the same slice of the query plan but on different segments are called "gangs". As a portion of work is completed, tuples flow up the query plan from one gang of processes to the next. This inter-process communication between segments is referred to as the interconnect component of Apache Cloudberry. -The following section introduces some of the basic principles of query and performance tuning in a Cloudberry database. +The following section introduces some of the basic principles of query and performance tuning in a Apache Cloudberry. Some items to consider in performance tuning: @@ -52,11 +52,11 @@ After doing the following exercises, you are expected to finish the previous tut ### Analyze the tables -Cloudberry Database uses Multi-version Concurrency Control (MVCC) to guarantee data isolation, one of the ACID properties of relational databases. MVCC allows multiple users of the database to obtain consistent results for a query, even if the data is changing as the query is being executed. There can be multiple versions of rows in the database, but a query sees a snapshot of the database at a single point in time, containing only the versions of rows that are valid at that point in tim [...] +Apache Cloudberry uses Multi-version Concurrency Control (MVCC) to guarantee data isolation, one of the ACID properties of relational databases. MVCC allows multiple users of the database to obtain consistent results for a query, even if the data is changing as the query is being executed. There can be multiple versions of rows in the database, but a query sees a snapshot of the database at a single point in time, containing only the versions of rows that are valid at that point in time. [...] -In a Cloudberry Database, regular OLTP operations do not create the need for vacuuming out old rows, but loading data while tables are in use might create such a need. It is a best practice to `VACUUM` a table after a load. If the table is partitioned, and only a single partition is being altered, then a `VACUUM` on that partition might suffice. +In a Apache Cloudberry, regular OLTP operations do not create the need for vacuuming out old rows, but loading data while tables are in use might create such a need. It is a best practice to `VACUUM` a table after a load. If the table is partitioned, and only a single partition is being altered, then a `VACUUM` on that partition might suffice. -The `VACUUM FULL` command behaves much differently than `VACUUM`, and its use is not recommended in Cloudberry databases. It can be expensive in CPU and I/O consumption, cause bloat in indexes, and lock data for long periods of time. +The `VACUUM FULL` command behaves much differently than `VACUUM`, and its use is not recommended in Apache Cloudberry. It can be expensive in CPU and I/O consumption, cause bloat in indexes, and lock data for long periods of time. The ANALYZE command generates statistics about the distribution of data in a table. In particular, it stores histograms about the values in each of the columns. The query optimizer depends on these statistics to select the best plan for executing a query. For example, the optimizer can use distribution data to decide on join orders. One of the optimizer's goals in a join is to minimize the volume of data that must be analyzed and potentially moved between segments by using the statistics [...] @@ -210,13 +210,13 @@ By default, the sandbox instance disables the Pivotal Query Optimizer and you mi 20230726:14:42:49:031465 gpstop:mdw:gpadmin-[INFO]:-Gathering information and validating the environment... 20230726:14:42:49:031465 gpstop:mdw:gpadmin-[INFO]:-Obtaining Cloudberry Coordinator catalog information 20230726:14:42:49:031465 gpstop:mdw:gpadmin-[INFO]:-Obtaining Segment details from coordinator... - 20230726:14:42:49:031465 gpstop:mdw:gpadmin-[INFO]:-Cloudberry Version: 'postgres (Cloudberry Database) 1.0.0 build dev' + 20230726:14:42:49:031465 gpstop:mdw:gpadmin-[INFO]:-Cloudberry Version: 'postgres (Apache Cloudberry) 1.0.0 build dev' 20230726:14:42:49:031465 gpstop:mdw:gpadmin-[INFO]:-Signalling all postmaster processes to reload ``` ### Indexes and performance -Cloudberry Database does not depend upon indexes to the same degree as traditional data warehouse systems. Because the segments execute table scans in parallel, each segment scanning a small part of the table, the traditional performance advantage from indexes is gone. Indexes consume large amounts of space and require considerable CPU time slot to compute during data loads. There are, however, times when indexes are useful, especially for highly selective queries. When a query looks up [...] +Apache Cloudberry does not depend upon indexes to the same degree as traditional data warehouse systems. Because the segments execute table scans in parallel, each segment scanning a small part of the table, the traditional performance advantage from indexes is gone. Indexes consume large amounts of space and require considerable CPU time slot to compute during data loads. There are, however, times when indexes are useful, especially for highly selective queries. When a query looks up a [...] In this exercise, you work with the legacy optimizer to know how index can improve performance. You first run a single row lookup on the sample table without an index, then rerun the query after creating an index. @@ -282,7 +282,7 @@ tutorial=# EXPLAIN SELECT * FROM sample WHERE big = 12345 OR big = 12355; ### Row vs. column orientation -Cloudberry Database offers the ability to store a table in either row or column orientation. Both storage options have advantages, depending upon data compression characteristics, the kinds of queries executed, the row length, and the complexity, and the number of join columns. +Apache Cloudberry offers the ability to store a table in either row or column orientation. Both storage options have advantages, depending upon data compression characteristics, the kinds of queries executed, the row length, and the complexity, and the number of join columns. As a general rule, very wide tables are better stored in row orientation, especially if there are joins on many columns. Column orientation works well to save space with compression and to reduce I/O when there is much duplicated data in columns. @@ -573,17 +573,17 @@ Partitions can improve query performance dramatically. When a query predicate fi A common application for partitioning is to maintain a rolling window of data based on date, for example, a fact table containing the most recent 12 months of data. Using the `ALTER TABLE` statement, an existing partition can be dropped by removing its child file. This is much more efficient than scanning the entire table and removing rows with a `DELETE` statement. -Partitions might also be sub-partitioned. For example, a table can be partitioned by month, and the month partitions can be sub-partitioned by week. Cloudberry Database creates child files for the months and weeks. The actual data, however, is stored in the child files created for the week subpartitions. Only child files at the leaf level hold data. +Partitions might also be sub-partitioned. For example, a table can be partitioned by month, and the month partitions can be sub-partitioned by week. Apache Cloudberry creates child files for the months and weeks. The actual data, however, is stored in the child files created for the week subpartitions. Only child files at the leaf level hold data. When a new partition is added, you can run `ANALYZE` on just the data in that partition. `ANALYZE` can run on the root partition (the name of the table in the `CREATE TABLE` statement) or on a child file created for a leaf partition. If `ANALYZE` has already run on the other partitions and the data is static, it is not necessary to run it again on those partitions. -Cloudberry Database supports: +Apache Cloudberry supports: - Range partitioning: division of data based on a numerical range, such as date or price. - List partitioning: division of data based on a list of values, such as sales territory or product line. - A combination of both types. - + The following exercise compares `SELECT` statements with `WHERE` clauses that do and do not use a partitioned column. diff --git a/src/pages/bootcamp/101-6-backup-and-recovery-operations.md b/src/pages/bootcamp/101-6-backup-and-recovery-operations.md index caa07d34..5c2ac3f0 100644 --- a/src/pages/bootcamp/101-6-backup-and-recovery-operations.md +++ b/src/pages/bootcamp/101-6-backup-and-recovery-operations.md @@ -1,23 +1,23 @@ --- title: "[101-6] Lesson 6: Backup and Restore Operations" -description: Learn how to backup and restore your data in the Cloudberry Database. +description: Learn how to backup and restore your data in the Apache Cloudberry. --- :::info -The Cloudberry Database does not include the utility `gpbackup` by default. It's maintained separately. Please follow the [README](https://github.com/cloudberrydb/gpbackup) to install `gpbackup` before using it. +The Apache Cloudberry does not include the utility `gpbackup` by default. It's maintained separately. Please follow the [README](https://github.com/apache/cloudberry-gpbackup) to install `gpbackup` before using it. ::: -The parallel dump utility `gpbackup` backs up the CloudberryDB master instance and each active segment instance at the same time. +The parallel dump utility `gpbackup` backs up the Apache Cloudberry master instance and each active segment instance at the same time. By default, gpbackup creates dump files in the backups subdirectory. -Several dump files are created for the master, containing database information such as DDL statements, the CloudberryDB system catalog tables, and metadata files. gpbackup creates dump files for each segment. +Several dump files are created for the master, containing database information such as DDL statements, the Apache Cloudberry system catalog tables, and metadata files. gpbackup creates dump files for each segment. You can perform full or incremental backups. To restore a database to its state when an incremental backup was made, it will restore the previous full backup and all subsequent incremental backups. - + Each file created for a backup begins with a 14-digit timestamp key that identifies the backup set the file belongs to. @@ -25,9 +25,9 @@ gpbackup can be run directly in a terminal on the master host, or you can add it The parallel restore utility `gprestore` takes the timestamp key generated by gpbackup, validates the backup set, and restores the database objects and data into a distributed database in parallel. Parallel restore operations require a complete backup set created by gpbackup, a full backup and any required incremental backups. - + -The gpbackup utility provides flexibility and verification options for use with the automated backup files produced by gpbackup or with backup files moved from the CloudberryDB array to an alternate location. +The gpbackup utility provides flexibility and verification options for use with the automated backup files produced by gpbackup or with backup files moved from the Apache Cloudberry array to an alternate location. ## Exercises diff --git a/src/pages/bootcamp/102-cbdb-crash-course.md b/src/pages/bootcamp/102-cbdb-crash-course.md index d9fe333a..65edf3d5 100644 --- a/src/pages/bootcamp/102-cbdb-crash-course.md +++ b/src/pages/bootcamp/102-cbdb-crash-course.md @@ -1,9 +1,9 @@ --- -title: "[102] Cloudberry Database Crash Course" -description: If you want to learn the Cloudberry Database quickly, follow this crash course. +title: "[102] Apache Cloudberry Crash Course" +description: If you want to learn the Apache Cloudberry quickly, follow this crash course. --- -This crash course provides an extensive overview of Cloudberry Database, an open-source Massively Parallel Processing (MPP) database. It covers key concepts, features, utilities, and hands-on exercises to become proficient with CBDB. +This crash course provides an extensive overview of Apache Cloudberry, an open-source Massively Parallel Processing (MPP) database. It covers key concepts, features, utilities, and hands-on exercises to become proficient with CBDB. Topics include: @@ -33,18 +33,18 @@ Topics include: ## Lesson 0. Prerequisite -Before starting this crash course, spend some time going through the [Cloudberry Database Tutorials Based on Single-Node Installation](./#1-cloudberrydb-sandbox) to get familiar with Cloudberry Database and how it works. +Before starting this crash course, spend some time going through the [Apache Cloudberry Tutorials Based on Single-Node Installation](./#1-cloudberry-sandbox) to get familiar with Apache Cloudberry and how it works. ## Lesson 1. Where to read the official documentation -Take a quick look at the official [CBDB Documentation](https://cloudberrydb.org/docs). No need to worry if you do not understand everything. +Take a quick look at the official [Cloudberry Documentation](https://cloudberry.apache.org/docs). No need to worry if you do not understand everything. ## Lesson 2. How to install CBDB To begin your journey with CBDB, you are expected to install CBDB in your preferred environment. The following options are available: -- For testing or trying out CBDB in a sandbox environment, see [Install CBDB in a Sandbox](./cbdb-sandbox). -- For deploying CBDB in other environments (including the production environment) and the prerequisite software/hardware configuration, see [CBDB Deployment Guide](https://cloudberrydb.org/docs/cbdb-op-deploy-guide). +- For testing or trying out CBDB in a sandbox environment, see [Install Cloudberry in a Sandbox](./cbdb-sandbox). +- For deploying CBDB in other environments (including the production environment) and the prerequisite software/hardware configuration, see [Cloudberry Deployment Guide](https://cloudberry.apache.org/docs/cbdb-op-deploy-guide). ## Lesson 3. Cluster architecture @@ -138,7 +138,7 @@ Read the help information for these tools (`<tool_name> --help`). ```shell 20230823:16:14:23:004256 gpstart:mdw:gpadmin-[INFO]:-Starting gpstart with args: -a 20230823:16:14:23:004256 gpstart:mdw:gpadmin-[INFO]:-Gathering information and validating the environment... - 20230823:16:14:23:004256 gpstart:mdw:gpadmin-[INFO]:-Cloudberry Binary Version: 'postgres (Cloudberry Database) 1.0.0 build dev' + 20230823:16:14:23:004256 gpstart:mdw:gpadmin-[INFO]:-Cloudberry Binary Version: 'postgres (Apache Cloudberry) 1.0.0 build dev' 20230823:16:14:23:004256 gpstart:mdw:gpadmin-[INFO]:-Cloudberry Catalog Version: '302206171' 20230823:16:14:23:004256 gpstart:mdw:gpadmin-[INFO]:-Starting Coordinator instance in admin mode 20230823:16:14:23:004256 gpstart:mdw:gpadmin-[INFO]:-CoordinatorStart pg_ctl cmd is env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /data0/database/master/gpseg-1 -l /data0/database/master/gpseg-1/log/startup.log -w -t 600 -o " -p 5432 -c gp_role=utility " start @@ -176,7 +176,7 @@ Read the help information for these tools (`<tool_name> --help`). 20230823:16:14:18:004143 gpstop:mdw:gpadmin-[INFO]:-Gathering information and validating the environment... 20230823:16:14:18:004143 gpstop:mdw:gpadmin-[INFO]:-Obtaining Cloudberry Coordinator catalog information 20230823:16:14:18:004143 gpstop:mdw:gpadmin-[INFO]:-Obtaining Segment details from coordinator... - 20230823:16:14:18:004143 gpstop:mdw:gpadmin-[INFO]:-Cloudberry Version: 'postgres (Cloudberry Database) 1.0.0 build dev' + 20230823:16:14:18:004143 gpstop:mdw:gpadmin-[INFO]:-Cloudberry Version: 'postgres (Apache Cloudberry) 1.0.0 build dev' 20230823:16:14:18:004143 gpstop:mdw:gpadmin-[INFO]:-Commencing Coordinator instance shutdown with mode='smart' 20230823:16:14:18:004143 gpstop:mdw:gpadmin-[INFO]:-Coordinator segment instance directory=/data0/database/master/gpseg-1 20230823:16:14:18:004143 gpstop:mdw:gpadmin-[INFO]:-Stopping coordinator segment and waiting for user connections to finish ... @@ -215,8 +215,8 @@ Read the log entries for `gpstop` and `gpstart`, and try to understand what they ```shell 20230823:16:17:41:004530 gpstate:mdw:gpadmin-[INFO]:-Starting gpstate with args: - 20230823:16:17:41:004530 gpstate:mdw:gpadmin-[INFO]:-local Cloudberry Version: 'postgres (Cloudberry Database) 1.0.0 build dev' - 20230823:16:17:41:004530 gpstate:mdw:gpadmin-[INFO]:-coordinator Cloudberry Version: 'PostgreSQL 14.4 (Cloudberry Database 1.0.0 build dev) on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Aug 9 2023 14:45:43' + 20230823:16:17:41:004530 gpstate:mdw:gpadmin-[INFO]:-local Cloudberry Version: 'postgres (Apache Cloudberry) 1.0.0 build dev' + 20230823:16:17:41:004530 gpstate:mdw:gpadmin-[INFO]:-coordinator Cloudberry Version: 'PostgreSQL 14.4 (Apache Cloudberry 1.0.0 build dev) on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Aug 9 2023 14:45:43' 20230823:16:17:41:004530 gpstate:mdw:gpadmin-[INFO]:-Obtaining Segment details from coordinator... 20230823:16:17:41:004530 gpstate:mdw:gpadmin-[INFO]:-Gathering data from segments... 20230823:16:17:41:004530 gpstate:mdw:gpadmin-[INFO]:-Cloudberry instance status summary @@ -267,7 +267,7 @@ Check the cluster state and try to collect the information using `gpstate` or `g **CBDB mirroring overview:** -Each segment instance in a Cloudberry Database has 2 possible roles: primary and mirror. +Each segment instance in a Apache Cloudberry has 2 possible roles: primary and mirror. - Primary role: serves user queries. - Mirror role: tracks and records data changes from the primary using WAL replication but does not serve user queries. @@ -323,8 +323,8 @@ If your CBDB cluster was initially created without mirrors, you can use the `gpa ```shell 20230823:16:02:50:003517 gpaddmirrors:mdw:gpadmin-[INFO]:-Starting gpaddmirrors with args: -20230823:16:02:50:003517 gpaddmirrors:mdw:gpadmin-[INFO]:-local Cloudberry Version: 'postgres (Cloudberry Database) 1.0.0 build dev' -20230823:16:02:50:003517 gpaddmirrors:mdw:gpadmin-[INFO]:-coordinator Cloudberry Version: 'PostgreSQL 14.4 (Cloudberry Database 1.0.0 build dev) on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Aug 9 2023 14:45:43' +20230823:16:02:50:003517 gpaddmirrors:mdw:gpadmin-[INFO]:-local Cloudberry Version: 'postgres (Apache Cloudberry) 1.0.0 build dev' +20230823:16:02:50:003517 gpaddmirrors:mdw:gpadmin-[INFO]:-coordinator Cloudberry Version: 'PostgreSQL 14.4 (Apache Cloudberry 1.0.0 build dev) on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Aug 9 2023 14:45:43' 20230823:16:02:50:003517 gpaddmirrors:mdw:gpadmin-[INFO]:-Obtaining Segment details from coordinator... 20230823:16:02:50:003517 gpaddmirrors:mdw:gpadmin-[INFO]:-Successfully finished pg_controldata /data0/database/primary/gpseg0 for dbid 2: stdout: pg_control version number: 13000700 diff --git a/src/pages/bootcamp/103-cbdb-performance-benchmark-tpcds.md b/src/pages/bootcamp/103-cbdb-performance-benchmark-tpcds.md index 7f9f1e96..c8189b2d 100644 --- a/src/pages/bootcamp/103-cbdb-performance-benchmark-tpcds.md +++ b/src/pages/bootcamp/103-cbdb-performance-benchmark-tpcds.md @@ -1,9 +1,9 @@ --- -title: "[103-2] TPC-DS: Decision Support Benchmark for Cloudberry Database" -description: Run the TPC-DS benchmark automatically on an existing Cloudberry Database cluster. +title: "[103-2] TPC-DS: Decision Support Benchmark for Apache Cloudberry" +description: Run the TPC-DS benchmark automatically on an existing Apache Cloudberry cluster. --- -This tool is based on the benchmark tool [Pivotal TPC-DS](https://github.com/pivotal/TPC-DS). This repo contains automation of running the DS benchmark on an existing CloudberryDB cluster. +This tool is based on the benchmark tool [Pivotal TPC-DS](https://github.com/pivotal/TPC-DS). This repo contains automation of running the DS benchmark on an existing Apache Cloudberry cluster. :::note @@ -29,9 +29,9 @@ As of version 1.2 of this tool TPC-DS 3.2.0 is used. ### Prerequisites -This is a follow-up tutorial for previous bootcamp steps. Please make sure to have the environment ready for Cloudberry Database Sandbox up and running. +This is a follow-up tutorial for previous bootcamp steps. Please make sure to have the environment ready for Apache Cloudberry Sandbox up and running. -All the following examples use the standard hostname convention of CloudberryDB using `mdw` for master node, and `sdw1..n` for the segment nodes. +All the following examples use the standard hostname convention of Cloudberry using `mdw` for master node, and `sdw1..n` for the segment nodes. ### TPC-DS Tools Dependencies @@ -56,7 +56,7 @@ TPC-H and TPC-DS packages are already under "mdw:/tmp/" folder. ### Execution -To run the benchmark, login as `gpadmin` on `mdw` in the CloudberryDB Sandbox, and execute the following command:: +To run the benchmark, login as `gpadmin` on `mdw` in the Cloudberry Sandbox, and execute the following command:: ```bash su - gpadmin diff --git a/src/pages/bootcamp/103-cbdb-performance-benchmark-tpch.md b/src/pages/bootcamp/103-cbdb-performance-benchmark-tpch.md index 594093b6..8c1e74e4 100644 --- a/src/pages/bootcamp/103-cbdb-performance-benchmark-tpch.md +++ b/src/pages/bootcamp/103-cbdb-performance-benchmark-tpch.md @@ -1,10 +1,10 @@ --- -title: "[103-1] TPC-H: Decision Support Benchmark for Cloudberry Database" -description: Run the TPC-H benchmark automatically on an existing Cloudberry Database cluster. +title: "[103-1] TPC-H: Decision Support Benchmark for Apache Cloudberry" +description: Run the TPC-H benchmark automatically on an existing Apache Cloudberry cluster. --- This tool is based on the benchmark tool [TPC-H](https://www.tpc.org/tpch/default5.asp). -This repo will guide you on how to run the TPC-H benchmark automatically on an existing CloudberryDB cluster in the CloudberryDB Sandbox. +This repo will guide you on how to run the TPC-H benchmark automatically on an existing Apache Cloudberry cluster in the Apache Cloudberry Sandbox. :::note @@ -27,7 +27,7 @@ TPC has published the following TPC-H standards over time: ### Prerequisites -This is a follow-up tutorial for previous bootcamp steps. Please make sure to have the environment ready for Cloudberry Database Sandbox up and running. +This is a follow-up tutorial for previous bootcamp steps. Please make sure to have the environment ready for Apache Cloudberry Sandbox up and running. ### TPC-H Tools Dependencies @@ -54,7 +54,7 @@ TPC-H and TPC-DS packages are already placed under "mdw:/tmp/" folder. ### Execution -To run the benchmark, login as `gpadmin` on `mdw` in the CloudberryDB Sandbox, and execute the following command: +To run the benchmark, login as `gpadmin` on `mdw` in the Apache Cloudberry Sandbox, and execute the following command: ```bash su - gpadmin diff --git a/src/pages/bootcamp/104-1-introduction-to-cloudberrydb-in-database-analytics.md b/src/pages/bootcamp/104-1-introduction-to-cloudberrydb-in-database-analytics.md index 74fb999b..9a189bbd 100644 --- a/src/pages/bootcamp/104-1-introduction-to-cloudberrydb-in-database-analytics.md +++ b/src/pages/bootcamp/104-1-introduction-to-cloudberrydb-in-database-analytics.md @@ -1,20 +1,20 @@ --- -title: "[104-1] Introduction to CloudberryDB In-Database Analytics" -description: Run analytics directly in the Cloudberry Database by MADlib. +title: "[104-1] Introduction to Apache Cloudberry In-Database Analytics" +description: Run analytics directly in the Apache Cloudberry by MADlib. --- -Running analytics directly in Cloudberry Database, rather than exporting data to a separate analytics engine, allows greater agility when exploring large data sets and much better performance due to parallelizing the analytic processes across all the segments. +Running analytics directly in Apache Cloudberry, rather than exporting data to a separate analytics engine, allows greater agility when exploring large data sets and much better performance due to parallelizing the analytic processes across all the segments. -A variety of power analytic tools is available for use with Cloudberry Database: +A variety of power analytic tools is available for use with Apache Cloudberry: * MADlib, an open-source, MPP implementation of many analytic algorithms, available at [http://madlib.apache.org/](http://madlib.apache.org/) * R statistical language * SAS, in many forms, but especially with the SAS Accelerator for Cloudberry * PMML, Predictive Modeling Markup Language -The exercises in this chapter introduce using MADlib with Cloudberry Database, using the FAA on-time data example dataset. You will examine scenarios comparing airlines and airports to learn whether there are significant relationships to be found. +The exercises in this chapter introduce using MADlib with Apache Cloudberry, using the FAA on-time data example dataset. You will examine scenarios comparing airlines and airports to learn whether there are significant relationships to be found. -In this lesson, you will use [Apache Zeppelin](https://zeppelin.apache.org/) to submit SQL statements to the Cloudberry Database. Apache Zeppelin is a web-based notebook that enables interactive data analytics. A [PostgreSQL interpreter](https://issues.apache.org/jira/browse/ZEPPELIN-250) has been added to Zeppelin, so that it can now work directly with products such as Pivotal Cloudberry Database and Pivotal HDB. +In this lesson, you will use [Apache Zeppelin](https://zeppelin.apache.org/) to submit SQL statements to the Apache Cloudberry. Apache Zeppelin is a web-based notebook that enables interactive data analytics. A [PostgreSQL interpreter](https://issues.apache.org/jira/browse/ZEPPELIN-250) has been added to Zeppelin, so that it can now work directly with products such as Pivotal Apache Cloudberry and Pivotal HDB. ## Prepare Apache Zeppelin @@ -48,7 +48,7 @@ In this lesson, you will use [Apache Zeppelin](https://zeppelin.apache.org/) to ## Run PostgreSQL built-in aggregates -PostgreSQL has built-in aggregate functions to get standard statistics on database columns—minimum, maximum, average, and standard deviation, for example. The functions take advantage of the Cloudberry Database MPP architecture, aggregating data on the segments and then assembling results on the master. +PostgreSQL has built-in aggregate functions to get standard statistics on database columns—minimum, maximum, average, and standard deviation, for example. The functions take advantage of the Apache Cloudberry MPP architecture, aggregating data on the segments and then assembling results on the master. First, gather simple descriptive statistics on some of the data you will analyze with MADlib. The commands in this exercise are in the stats.sql script in the sample data directory. diff --git a/src/pages/bootcamp/104-2-hashml-for-datascience.md b/src/pages/bootcamp/104-2-hashml-for-datascience.md deleted file mode 100644 index 66a6f8f5..00000000 --- a/src/pages/bootcamp/104-2-hashml-for-datascience.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -title: "[104-2] HashML for Data Science" -description: More is coming ---- - -More is coming! \ No newline at end of file diff --git a/src/pages/bootcamp/cbdb-sandbox.md b/src/pages/bootcamp/cbdb-sandbox.md index 1edbfac0..39d353ff 100644 --- a/src/pages/bootcamp/cbdb-sandbox.md +++ b/src/pages/bootcamp/cbdb-sandbox.md @@ -1,9 +1,9 @@ --- -title: Install Single-Node Cloudberry Database in a Docker Container (Sandbox) -description: Learn how to quickly set up and connect to a single-node Cloudberry Database in a Docker environment. +title: Install Single-Node Apache Cloudberry in a Docker Container (Sandbox) +description: Learn how to quickly set up and connect to a single-node Apache Cloudberry in a Docker environment. --- -This document guides you on how to quickly set up and connect to a single-node Cloudberry Database in a Docker environment. You can try out Cloudberry Database by performing some basic operations and running SQL commands. +This document guides you on how to quickly set up and connect to a single-node Apache Cloudberry in a Docker environment. You can try out Apache Cloudberry by performing some basic operations and running SQL commands. :::warning @@ -20,16 +20,16 @@ Make sure that your environment meets the following requirements: ## Build the Sandbox -This section introduces how to set up the Docker container in which the source code of Cloudberry Database v1.5.1 (released in [Cloudberry Database Release Page](https://github.com/cloudberrydb/cloudberrydb/releases)) will be compiled. In this CentOS 7.9 Docker container, a single-node cluster will be initialized with one coordinator and two segments. Both x86 and ARM CPUs (including Apple chips) are supported. +This section introduces how to set up the Docker container in which the source code of Apache Cloudberry v1.5.1 (released in [Apache Cloudberry Release Page](https://github.com/apache/cloudberry/releases)) will be compiled. In this CentOS 7.9 Docker container, a single-node cluster will be initialized with one coordinator and two segments. Both x86 and ARM CPUs (including Apple chips) are supported. Build steps: 1. Start Docker Desktop and make sure it is running properly on your host platform. -2. Download this repository (which is [cloudberrydb/bootcamp](https://github.com/cloudberrydb/bootcamp)) to the target machine. +2. Download this repository (which is [apache/cloudberry-bootcamp](https://github.com/apache/cloudberry-bootcamp)) to the target machine. ```shell - git clone https://github.com/cloudberrydb/bootcamp.git + git clone https://github.com/apache/cloudberry-bootcamp.git ``` 3. Enter the repository and run the `run.sh` script to start the Docker container. This will start the automatic installation process. @@ -58,7 +58,7 @@ You can now connect to the database and try some basic operations. [root@mdw /]$ ``` -2. Log into Cloudberry Database in Docker. See the following commands and example outputs: +2. Log into Apache Cloudberry in Docker. See the following commands and example outputs: ```shell [root@mdw /] su - gpadmin # Switches to the gpadmin user. @@ -77,7 +77,7 @@ You can now connect to the database and try some basic operations. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----- - PostgreSQL 14.4 (Cloudberry Database 1.0.0 build dev) on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Oct 24 2023 10:24:28 + PostgreSQL 14.4 (Apache Cloudberry 1.0.0 build dev) on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Oct 24 2023 10:24:28 (1 row) ``` @@ -87,5 +87,5 @@ In addition to using the `docker exec` command, you can also use the `ssh` comma ssh gpadmin@localhost # Password: cbdb@123 ``` -Now you have a Cloudberry Database and can continue with [101 Cloudberry Database Tutorials](./#2-101-cloudberrydb-tourials)! Enjoy! +Now you have a Apache Cloudberry and can continue with [101 Apache Cloudberry Tutorials](./#2-101-cloudberry-tourials)! Enjoy! --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
