This is an automated email from the ASF dual-hosted git repository.
luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 985de93f61 [blog](update) Update Job Scheduler blog (#726)
985de93f61 is described below
commit 985de93f61feec1dc12fdb3b1b152e1fb1761972
Author: KassieZ <[email protected]>
AuthorDate: Tue Jun 11 09:58:40 2024 +0700
[blog](update) Update Job Scheduler blog (#726)
---
blog/Annoucing.md | 4 +-
...pache-doris-sql-convertor-for-easy-migration.md | 2 -
blog/job-scheduler-for-task-automation.md | 211 +++++++++++++++++++++
...ti-tenant-workload-isolation-in-apache-doris.md | 2 +-
blog/release-note-2.0.11.md | 2 +-
src/components/recent-blogs/recent-blogs.data.ts | 8 +-
src/constant/newsletter.data.ts | 15 +-
static/images/auto-data-synchronization.png | Bin 0 -> 161954 bytes
.../images/job-scheduler-for-task-automation.jpg | Bin 0 -> 509193 bytes
.../images/technical-design-and-implementation.png | Bin 0 -> 297441 bytes
10 files changed, 227 insertions(+), 17 deletions(-)
diff --git a/blog/Annoucing.md b/blog/Annoucing.md
index df1b3f7ba9..04d64dbb1d 100644
--- a/blog/Annoucing.md
+++ b/blog/Annoucing.md
@@ -11,7 +11,7 @@
Apache Doris is a modern, high-performance and real-time analytical database
based on MPP. It is well known for its high-performance and easy-to-use. It can
return query results under massive data within only sub-seconds. It can support
not only high concurrent point query scenarios, but also complex analysis
scenarios with high throughput. Based on this, Apache Doris can be well applied
in many business fields, such as multi-dimensional reporting, user portrait,
ad-hoc query, real-time [...]
-Apache Doris was first born in the Palo Project within Baidu's advertising
report business and officially opened source in 2017. It was donated by Baidu
to Apache foundation for incubation in July 2018, and then incubated and
operated by members of the podling project management committee(PPMC)under the
guidance of Apache incubator mentors.
+Apache Doris was first born in the Palo Project within Baidu's advertising
report business and officially opened source in 2017. It was donated by Baidu
to Apache foundation for incubation in July 2018, and then incubated and
operated by members of the podling project management committee (PPMC) under
the guidance of Apache incubator mentors.
We are very proud that Doris graduated from Apache incubator successfully. It
is an important milestone. In the whole incubating period, with the guidance of
Apache Way and the help of incubator mentors, we learned how to develop our
project and community in Apache Way, and have achieved great growth in this
process.
@@ -39,7 +39,7 @@ Apache Doris will carry out more challenging and meaningful
work in the future,
Once again, we sincerely thank all contributors who participated in the
construction of Apache Doris community and all users who use Apache Doris and
constantly put forward improvement suggestions. At the same time, we also thank
our incubator mentors, IPMC members and friends in various open source project
communities who have continuously encouraged, supported and helped us all the
way.
-**Apache Doris GitHub:**
+**Apache Doris GitHub:**
[https://github.com/apache/doris](https://github.com/apache/doris)
diff --git
a/blog/from-presto-trino-clickhouse-and-hive-to-apache-doris-sql-convertor-for-easy-migration.md
b/blog/from-presto-trino-clickhouse-and-hive-to-apache-doris-sql-convertor-for-easy-migration.md
index d520a34b60..4efec23362 100644
---
a/blog/from-presto-trino-clickhouse-and-hive-to-apache-doris-sql-convertor-for-easy-migration.md
+++
b/blog/from-presto-trino-clickhouse-and-hive-to-apache-doris-sql-convertor-for-easy-migration.md
@@ -4,8 +4,6 @@
'summary': "Users can execute queries with their old SQL syntaxes directly
in Doris or batch convert their existing SQL statements on the visual SQL
conversion interface.",
'date': '2024-05-06',
'author': 'Apache Doris',
- 'picked': "true",
- 'order': "4",
'tags': ['Tech Sharing'],
"image": '/images/sql-convertor-feature.jpeg'
}
diff --git a/blog/job-scheduler-for-task-automation.md
b/blog/job-scheduler-for-task-automation.md
new file mode 100644
index 0000000000..df8d403f78
--- /dev/null
+++ b/blog/job-scheduler-for-task-automation.md
@@ -0,0 +1,211 @@
+---
+{
+ 'title': 'Another lifesaver for data engineers: Apache Doris Job Scheduler
for task automation',
+ 'summary': "The built-in Doris Job Scheduler triggers pre-defined
operations efficiently and reliably. It is useful in many cases including ETL
and data lake analytics.",
+ 'date': '2024-06-06',
+ 'author': 'Apache Doris',
+ 'tags': ['Best Practice'],
+ 'picked': "true",
+ 'order': "1",
+ "image": '/images/job-scheduler-for-task-automation.jpg'
+}
+
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Job scheduling is an important part of data management as it enables regular
data updates and cleanups. In a data platform, it is often undertaken by
workflow orchestration tools like [Apache Airflow](https://airflow.apache.org)
and [Apache Dolphinscheduler](https://dolphinscheduler.apache.org/en-us).
However, adding another component to the data architecture also means investing
extra resources for management and maintenance. That's why [Apache Doris
2.1.0](https://doris.apache.org/blog [...]
+
+The Doris Job Scheduler triggers the pre-defined operations at specific time
points or intervals, thus allowing for efficient and reliable task automation.
Its key capabilities include:
+
+- **Efficiency**: It adopts the TimeWheel algorithm to ensure that the
triggering of tasks is precise to the second.
+
+- **Flexibility**: It supports both one-time jobs and regular jobs. For the
latter, users can define the start/end time, and intervals of minutes, hours,
days, or weeks.
+
+- **Execution thread pool and processing queue**: It is supported by a
Disruptor-based single-producer, multi-consumer model to avoid task execution
overload.
+
+- **Traceability**: It keeps track of the latest task execution records
(configurable), which are queryable by a simple command.
+
+- **Availability**: Like Apache Doris itself, the Doris Job Scheduler is
easily recoverable and highly available.
+
+## Syntax & examples
+
+### Syntax description
+
+A valid job statement consists of the following elements:
+
+- `CREATE JOB`: Specifies the job name as a unique identifier.
+
+- The `ON SCHEDULE` clause: Specifies the type, trigger time, and frequency of
the job.
+
+ - `AT timestamp`: This is used to specify a one-time job. `AT
CURRENT_TIMESTAMP` means that the job will run immediately upon creation.
+
+ - `EVERY`: This is used to specify a regular job. You can define the
execution frequency of the job. The interval can be measured in weeks, days,
hours, and minutes.
+
+ - The `EVERY` clause supports an optional `STARTS` clause with a
timestamp to define the start time of the recurring schedule.
`CURRENT_TIMESTAMP` can be used. It also supports an optional `ENDS` clause to
specify the end time for the job.
+
+- The `DO` clause defines the action to be performed when the job is executed.
At this time, the only supported operation is INSERT.
+
+ ```sql
+ CREATE
+ JOB
+ job_name
+ ON SCHEDULE schedule
+ [COMMENT 'string']
+ DO execute_sql;
+
+ schedule: {
+ AT timestamp
+ | EVERY interval
+ [STARTS timestamp ]
+ [ENDS timestamp ]
+ }
+
+ interval:
+ quantity { WEEK |DAY | HOUR | MINUTE
+ }
+
+ ```
+
+ Example:
+
+ ```sql
+ CREATE JOB my_job ON SCHEDULE EVERY 1 MINUTE DO INSERT INTO db1.tbl1 SELECT
* FROM db2.tbl2;
+ ```
+
+ The above statement creates a job named `my_job`, which is to load data from
`db2.tbl2` to `db1.tbl1` every minute.
+
+### More examples
+
+**Create a one-time job**: Load data from `db2.tbl2` to `db1.tbl1` at
2025-01-01 00:00:00.
+
+```sql
+CREATE JOB my_job ON SCHEDULE AT '2025-01-01 00:00:00' DO INSERT INTO db1.tbl1
SELECT * FROM db2.tbl2;
+```
+
+**Create a regular job without specifying the end time**: Load data from
`db2.tbl2` to `db1.tbl1` once a day starting from 2025-01-01 00:00:00.
+
+```sql
+CREATE JOB my_job ON SCHEDULE EVERY 1 DAY STARTS '2025-01-01 00:00:00' DO
INSERT INTO db1.tbl1 SELECT * FROM db2.tbl2 WHERE create_time >=
days_add(now(),-1);
+```
+
+**Create a regular job within a specified period**: Load data from `db2.tbl2`
to `db1.tbl1` once a day, beginning at 2025-01-01 00:00:00 and finishing at
2026-01-01 00:10:00.
+
+```sql
+CREATE JOB my_job ON SCHEDULER EVERY 1 DAY STARTS '2025-01-01 00:00:00' ENDS
'2026-01-01 00:10:00' DO INSERT INTO db1.tbl1 SELECT * FROM db2.tbl2
create_time >= days_add(now(),-1);
+```
+
+**Asynchronous execution**: Because jobs are executed in an asynchronous
manner in Doris. Tasks that require asynchronous execution, such as `insert
into select`, can be implemented by a job.
+
+For example, to asynchronously execute data loading from `db2.tbl2` to
`db1.tbl1`, simply create a one-time job for it and schedule it at
`current_timestamp`.
+
+```Bash
+CREATE JOB my_job ON SCHEDULE AT current_timestamp DO INSERT INTO db1.tbl1
SELECT * FROM db2.tbl2;
+```
+
+## Auto data synchronization
+
+The combination of the Job Scheduler and the
[Multi-Catalog](https://doris.apache.org/docs/lakehouse/lakehouse-overview#multi-catalog)
feature of Apache Doris is an efficient way to implement regular data
synchronization across data sources.
+
+This is useful in many cases, such as for an e-commerce user who regularly
needs to load business data from MySQL to Doris for analysis.
+
+**Example**: To filter consumers by total consumption amount, last visit time,
sex, and city in the table below, and import the query results to Doris
regularly.
+
+
+
+**Step 1**: Create a table in Doris
+
+```sql
+CREATE TABLE IF NOT EXISTS user_activity
+(
+ `user_id` LARGEINT NOT NULL COMMENT "User ID",
+ `date` DATE NOT NULL COMMENT "Time of data import",
+ `city` VARCHAR(20) COMMENT "User city",
+ `age` SMALLINT COMMENT "User age",
+ `sex` TINYINT COMMENT "User sex",
+ `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT
"Time of user's last visit",
+ `cost` BIGINT SUM DEFAULT "0" COMMENT "User's total consumption amount",
+ `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum dwell time of user",
+ `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum dwell time of
user"
+)
+AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
+DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
+PROPERTIES (
+"replication_allocation" = "tag.location.default: 1"
+);
+```
+
+**Step 2**: Create a catalog in Doris to map to the data in MySQL
+
+```Bash
+CREATE CATALOG activity PROPERTIES (
+ "type"="jdbc",
+ "user"="root",
+ "jdbc_url" = "jdbc:mysql://127.0.0.1:9734/user?useSSL=false",
+ "driver_url" = "mysql-connector-java-5.1.49.jar",
+ "driver_class" = "com.mysql.jdbc.Driver"
+);
+```
+
+**Step 3**: Ingest data from MySQL to Doris. Leverage the catalog mechanism
and the Insert Into method for full data ingestion. (We recommend that such
operations be executed during low-traffic hours to minimize potential service
disruptions.)
+
+- **One-time job**: Schedule a one-time full-scale data loading that starts at
2024-8-10 03:00:00.
+
+ ```sql
+ CREATE JOB one_time_load_job
+ ON SCHEDULE
+ AT '2024-8-10 03:00:00'
+ DO
+ INSERT INTO user_activity FROM SELECT * FROM activity.user.activity
+
+ ```
+
+- **Regular job**: Create a regular job to update data periodically.
+
+ ```sql
+ CREATE JOB schedule_load
+ ON SCHEDULE EVERY 1 DAY
+ DO
+ INSERT INTO user_activity FROM SELECT * FROM activity.user.activity where
create_time >= days_add(now(),-1)
+ ```
+
+## Technical design & implementation
+
+Efficient scheduling often comes at the cost of significant resource
consumption, and high-precision scheduling is even more resource-intensive. To
implement job scheduling, some people rely on the built-in scheduling
capabilities of Java, while others employ job scheduling libraries. But what if
we want higher precision and lower memory usage than these solutions can reach?
For that, the Doris makers combine the TimingWheel algorithm with the Disruptor
framework to achieve second-level [...]
+
+
+
+To implement the TimingWheel algorithm, we leverage the HashedWheelTimer in
Netty. The Job Manager puts tasks every 10 minutes (by default) in the
TimeWheel for scheduling. In order to ensure efficient task triggering and
avoid high resource usage, we adopt a Disruptor-based single-producer,
multi-consumer model. The TimeWheel only triggers tasks but does not execute
jobs directly. Tasks that need to be triggered upon expiration will be put into
a Dispatch thread and distributed to an ap [...]
+
+This is how we improve processing efficiency by reducing unnecessary
traversal: For one-time tasks, their definition will be removed after
execution. For recurring tasks, the system events in the TimeWheel will
periodically fetch the next round of execution tasks. This helps to avoid the
accumulation of tasks in a single bucket.
+
+In addition, for transactional tasks, the Job Scheduler can ensure data
consistency and integrity by the transaction association and transaction
callback mechanisms.
+
+## Applicable scenarios
+
+The Doris Job Scheduler is a Swiss Army Knife. It is not only useful in ETL
and data lake analytics as we mentioned, but also critical for the
implementation of [asynchronous materialized
views](https://doris.apache.org/docs/query/view-materialized-view/async-materialized-view).
An asynchronous materialized view is a pre-computed result set. Unlike normal
materialized views, it can be built on multiple tables. Thus, as you can
imagine, changes in any of the source tables will lead to the [...]
+
+Where are we going with the Doris Job Scheduler? The [Apache Doris developer
community](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2gmq5o30h-455W226d79zP3L96ZhXIoQ)
is looking at:
+
+- Displaying the distribution of tasks executed in different time slots on the
WebUI.
+
+- DAG jobs. This will allow data warehouse task orchestration within Apache
Doris, which will unlock many possibilities when it is combined with the
Multi-Catalog feature.
+
+- Support for more operations such as UPDATE and DELETE.
\ No newline at end of file
diff --git a/blog/multi-tenant-workload-isolation-in-apache-doris.md
b/blog/multi-tenant-workload-isolation-in-apache-doris.md
index 0ee91de040..4cfe9df002 100644
--- a/blog/multi-tenant-workload-isolation-in-apache-doris.md
+++ b/blog/multi-tenant-workload-isolation-in-apache-doris.md
@@ -5,7 +5,7 @@
'date': '2024-05-14',
'author': 'Apache Doris',
'picked': "true",
- 'order': "3",
+ 'order': "4",
'tags': ['Tech Sharing'],
"image": '/images/multi-tenant-workload-group.jpg'
}
diff --git a/blog/release-note-2.0.11.md b/blog/release-note-2.0.11.md
index cee29a425b..e4899dd5be 100644
--- a/blog/release-note-2.0.11.md
+++ b/blog/release-note-2.0.11.md
@@ -6,7 +6,7 @@
'author': 'Apache Doris',
'tags': ['Release Notes'],
'picked': "true",
- 'order': "1",
+ 'order': "2",
"image": '/images/2.0.11.jpg'
}
---
diff --git a/src/components/recent-blogs/recent-blogs.data.ts
b/src/components/recent-blogs/recent-blogs.data.ts
index 9a23d248f1..dca34b12b6 100644
--- a/src/components/recent-blogs/recent-blogs.data.ts
+++ b/src/components/recent-blogs/recent-blogs.data.ts
@@ -1,11 +1,11 @@
export const RECENT_BLOGS_POSTS = [
{
- label: 'Apache Doris for log and time series data analysis in NetEase,
why not Elasticsearch and InfluxDB?',
- link:
'https://doris.apache.org/blog/apache-doris-for-log-and-time-series-data-analysis-in-netease',
+ label: 'Apache Doris version 2.0.11 has been released',
+ link: 'https://doris.apache.org/blog/release-note-2.0.11',
},
{
- label: 'Apache Doris 2.1.3 just released',
- link: 'https://doris.apache.org/blog/release-note-2.1.3',
+ label: 'Apache Doris for log and time series data analysis in NetEase,
why not Elasticsearch and InfluxDB?',
+ link:
'https://doris.apache.org/blog/apache-doris-for-log-and-time-series-data-analysis-in-netease',
},
{
label: 'Multi-tenant workload isolation: a better balance between
isolation and utilization',
diff --git a/src/constant/newsletter.data.ts b/src/constant/newsletter.data.ts
index 74e28a19e8..d306fa140e 100644
--- a/src/constant/newsletter.data.ts
+++ b/src/constant/newsletter.data.ts
@@ -1,4 +1,11 @@
export const NEWSLETTER_DATA = [
+ {
+ tags: ['Best Practice'],
+ title: "Another lifesaver for data engineers: Apache Doris Job
Scheduler for task automation",
+ content: `The built-in Doris Job Scheduler triggers pre-defined
operations efficiently and reliably. It is useful in many cases including ETL
and data lake analytics.`,
+ to: '/blog/job-scheduler-for-task-automation',
+ image: 'job-scheduler-for-task-automation.jpg',
+ },
{
tags: ['Release Note'],
title: "Apache Doris version 2.0.11 just released",
@@ -20,11 +27,5 @@ export const NEWSLETTER_DATA = [
to: '/blog/multi-tenant-workload-isolation-in-apache-doris',
image: 'multi-tenant-workload-group.jpg',
},
- {
- tags: ['Tech Sharing'],
- title: "From Presto, Trino, ClickHouse, and Hive to Apache Doris: SQL
convertor for easy migration",
- content: `Users can execute queries with their old SQL syntaxes
directly in Doris or batch convert their existing SQL statements on the visual
SQL conversion interface.`,
- to:
'/blog/from-presto-trino-clickhouse-and-hive-to-apache-doris-sql-convertor-for-easy-migration',
- image: 'sql-convertor-feature.jpeg',
- },
+
];
diff --git a/static/images/auto-data-synchronization.png
b/static/images/auto-data-synchronization.png
new file mode 100644
index 0000000000..96b75db655
Binary files /dev/null and b/static/images/auto-data-synchronization.png differ
diff --git a/static/images/job-scheduler-for-task-automation.jpg
b/static/images/job-scheduler-for-task-automation.jpg
new file mode 100644
index 0000000000..566faa0304
Binary files /dev/null and
b/static/images/job-scheduler-for-task-automation.jpg differ
diff --git a/static/images/technical-design-and-implementation.png
b/static/images/technical-design-and-implementation.png
new file mode 100644
index 0000000000..bf0bb66d40
Binary files /dev/null and
b/static/images/technical-design-and-implementation.png differ
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]