This is an automated email from the ASF dual-hosted git repository.
luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 223ed386200 Add summit 2023 blog (#338)
223ed386200 is described below
commit 223ed3862004be543acc3ca9fee8878d45058795
Author: Hu Yanjun <[email protected]>
AuthorDate: Mon Nov 13 16:50:20 2023 +0800
Add summit 2023 blog (#338)
---
...expect-from-apache-doris-as-a-data-warehouse.md | 176 +++++++++++++++++++++
.../Apache-Doris-monthly-active-contributors.png | Bin 0 -> 241310 bytes
.../apache-doris-data-warehouse-layers.png | Bin 0 -> 176498 bytes
.../images/summit2023/data-writing-efficiency.png | Bin 0 -> 96030 bytes
static/images/summit2023/doris-manager.png | Bin 0 -> 192041 bytes
.../summit2023/hybrid-column-row-storage.png | Bin 0 -> 118081 bytes
.../kubernetes-operator-for-apache-doris.png | Bin 0 -> 143614 bytes
static/images/summit2023/merge-on-write.png | Bin 0 -> 187180 bytes
.../resource-isolation-workload-group.png | Bin 0 -> 170387 bytes
.../summit2023/schemaless-variant-data-type.png | Bin 0 -> 83456 bytes
.../summit2023/storage-compute-separation.png | Bin 0 -> 308766 bytes
static/images/summit2023/tiered-storage.png | Bin 0 -> 176558 bytes
static/images/summit2023/writing-throughput.png | Bin 0 -> 100815 bytes
13 files changed, 176 insertions(+)
diff --git
a/blog/apache-doris-summit-asia-2023-what-can-you-expect-from-apache-doris-as-a-data-warehouse.md
b/blog/apache-doris-summit-asia-2023-what-can-you-expect-from-apache-doris-as-a-data-warehouse.md
new file mode 100644
index 00000000000..47b345c8146
--- /dev/null
+++
b/blog/apache-doris-summit-asia-2023-what-can-you-expect-from-apache-doris-as-a-data-warehouse.md
@@ -0,0 +1,176 @@
+---
+{
+ 'title': 'Apache Doris Summit Asia 2023: What Can You Expect From Apache
Doris as a Data Warehouse?',
+ 'summary': "The past year marks a breakthrough of Apache Doris, an
open-source real-time data warehouse that has just undergone an overall upgrade
after long consistent incremental optimizations.",
+ 'date': '2023-11-10',
+ 'author': 'Apache Doris',
+ 'tags': ['Top News'],
+}
+
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+When it is cranberry and pumpkin season, we had the unforgettable Apache Doris
Summit Asia 2023 with our remarkable committers, users, and community partners,
to honor what we have achieved in the past year, and provide a preview of where
we are going next.
+
+The past year marks a breakthrough of [Apache
Doris](https://doris.apache.org/), an open-source real-time data warehouse that
has just undergone an overall upgrade after long consistent incremental
optimizations:
+
+**More**
+
+Thanks to the hard work of 275 committers, the [Apache Doris
2.0](https://doris.apache.org/blog/release-note-2.0.0) milestone has merged
over 4100 pull requests, representing a 70% increase from version 1.2 last year
and a 10-fold increase from 1.1.
+
+**Faster**
+
+This year, Apache Doris has attained a 10-fold performance increase in blind
benchmarking and single-table queries, a 13-fold increase in multi-table joins,
and a 20-fold increase in concurrent point queries. The high query performance
is supported by the smart design of Apache Doris, including a vectorized
execution engine, Merge-on-Write mechanism, the Light Schema Change feature, a
self-adaptive parallel execution model, and a [new query
optimizer](https://doris.apache.org/docs/query- [...]
+
+**Wider**
+
+We have built Apache Doris into more than just a powerful OLAP engine but also
a data warehouse for a wider range of use cases, including log analysis and
high-concurrency data services. To expand the data warehousing capabilities of
Apache Doris, we have introduced
[Multi-Catalog](https://doris.apache.org/docs/lakehouse/multi-catalog/) to
connect Doris to a wide array of data sources.
+
+## One of the most active open source big data projects
+
+Apache Doris has become one of the world's most active open-source big data
projects in all aspects:
+
+- It has hit **10K stars** on [GitHub](https://github.com/apache/doris/), a
year-on-year growth of 70%, and the momentum keeps going.
+- The community has included almost 600 contributors and welcomes new faces
every week.
+- With **120 monthly active contributors**, Apache Doris has become a more
active project than Apache Spark, Elasticsearch, Trino, and Apache Druid.
+- Over **160 pull requests** are created every week. Meanwhile, we have
established a mature code review pipeline, making sure that every pull request
stands the test of 3000 use cases. This is how we guarantee stability in the
midst of agile iteration.
+
+
+
+Along with such growth, we've also witnessed higher diversity among
contributors. They are engineers from tech giants and database unicorns, like
VeloDB, which is the commercial company based on Apache Doris. Many cloud
service providers, including Alibaba Cloud, Tencent Cloud, Huawei Cloud, AWS
and GCP (coming soon), have also jumped on the bandwagon and provided
Doris-based data warehouse cloud hosting services.
+
+## Fast-expanding user base
+
+Apache Doris now has a user base of over 30,000 data engineers from more than
**4000 enterprises**, including those from the tech sector, finance, telecom,
manufacturing, logistics, and retail. The great majority of them keep in close
touch with the Apache Doris developers, committing code, getting involved in
tests, and sharing experience and feedback with the community.
+
+## Fruit that have been reaped
+
+We aim to make Apache Doris the first choice for people in real-time data
analysis. What we have done in the past year can be concluded in three keywords:
+
+- **Real-time**: We have realized high-throughput real-time data writing and
updates, as well as low query latency.
+- **Unified**: As we've been trying to make Doris an all-in-one platform that
can undertake most of the analytic workloads for users, we have expanded and
enhanced the data lakehousing capabilities of Doris, enabled faster log
analysis, faster ELT/ETL, and faster response to point queries.
+- **Cloud-native**: This is a leap towards cloud infrastructure. Apache Doris
can now be deployed and run on Kubernetes to reduce storage and computation
costs.
+
+### Real-time response to queries
+
+As is said, Apache Doris 2.0 delivers 10 times faster query speed than the
previous versions, but what is the key accelerator behind such high
performance? It is the [cost-based query
optimizer](https://doris.apache.org/docs/query-acceleration/nereids/) and the
self-adaptive [pipeline parallel execution
model](https://doris.apache.org/docs/query-acceleration/pipeline-execution-engine/)
of Apache Doris.
+
+In traditional data reporting, data is often arranged in flat tables. The idea
of flat tables and pre-aggregated tables is to trade storage space for query
speed. In these cases, the key to high performance is to accelerate data
scanning and aggregation. However, since nowadays data analytic workloads
involve more complex computations with more and larger batch processing, data
engineers often have to fine-tune the database and rewrite the SQL before they
can enjoy satisfactory query spe [...]
+
+Similarly, the new version of Doris has automated another
engineering-intensive process: adjusting the compute instance execution
concurrency in the backend. What bothered our users was that when queries of
different sizes happened concurrently, these queries tended to fight for
resources and thus required human intervention. To solve that, we have
introduced a pipeline execution model. It automatically decides the execution
concurrency for the current situation to make sure queries of a [...]
+
+For **[high concurrency point
queries](https://doris.apache.org/blog/How-We-Increased-Database-Query-Concurrency-by-20-Times)**,
Apache Doris 2.0 reached a throughput of 30,000 QPS. It is a 20-fold
improvement driven by optimizations in data storage, reading, and query
execution. As a column-oriented DBMS, Apache Doris has relatively low row
reading efficiency, so we have introduced ow/column hybrid storage and [row
cache](https://doris.apache.org/docs/query-acceleration/hight-concurrent [...]
+
+
+
+For **multi-dimensional data analysis**, we introduced [inverted
index](https://doris.apache.org/blog/Building-A-Log-Analytics-Solution-10-Times-More-Cost-Effective-Than-Elasticsearch)
to accelerate fuzzy keyword queries, equivalence queries, and range queries.
+
+### Real-time data writing and update
+
+Data writing is another side of the real-time story, so we also spent great
efforts improving the data ingestion speed of Apache Doris. After optimizations
like Memtable parallel flushing and single-copy ingestion, Apache Doris is now
2~8 times faster in data writing.
+
+
+
+The
**[Merge-on-Write](https://doris.apache.org/docs/data-table/data-model#merge-on-write)**
mechanism has been upgraded in version 2.0. It enables an upsert throughput of
nearly 1 million rows per second, and it now supports a wider range of updating
operations, including partial column updates.
+
+
+
+### Support for more use cases
+
+For **[data
lakehousing](https://doris.apache.org/blog/Building-the-Next-Generation-Data-Lakehouse-10X-Performance)**,
our last big move was to introduce
[Multi-Catalog](https://doris.apache.org/docs/lakehouse/multi-catalog/) for
auto-mapping and auto-synchronization of heterogeneous data sources. In 2.0, we
have further enhanced that. It now supports even more data sources, and it is
also much faster in various production environments. With multi-catalog, users
can ingest their multi-so [...]
+
+For **[log
analysis](https://doris.apache.org/blog/Building-A-Log-Analytics-Solution-10-Times-More-Cost-Effective-Than-Elasticsearch)**,
Doris 2.0 provides native support for semi-structured data, which can be
arranged in data types like Json, Array, and Map. On the basis of Light Schema
Change, it allows Schema Evolution. In addition to the foregoing inverted
index, Doris 2.0 comes with a high-performance text analysis algorithm. Built
on its large-size data writing and low-cost storage [...]
+
+For different analytic workloads in one single cluster, the Doris solution to
**resource isolation** is [Workload
Group](https://doris.apache.org/docs/admin-manual/workload-group). As the name
implies, it is to divide various workloads into groups and thus allow more
flexible use of memory and CPU resources. Users can limit the number of queries
that a workload group can handle concurrently, so when there are too many query
requests, the excessive ones will wait in a queue. This is a way [...]
+
+
+
+### Low cost and high availability
+
+Apache Doris provides **[tiered
storage](https://doris.apache.org/blog/Tiered-Storage-for-Hot-and-Cold-Data-What-Why-and-How)**.
The less frequently accessed data, namely, cold data, will be put into object
storage to reduce costs. Moreover, since object storage only requires a single
copy of data, the storage costs will be further cut by 2/3 compared to
3-replica storage. Calculation based on AWS pricing shows that tiered storage
can save you 70% of your cloud disk expenditure.
+
+
+
+To facilitate Kubernetes deployment, we have built a **Kubernetes Operator**.
With it, users can easily deploy, scale, inspect, and maintain all Apache Doris
nodes (frontends, backends, compute nodes, brokers) on Kubernetes. Compute node
is a variant of backend nodes but it does not store any data, which is why it
is a good fit for auto-scaling of clusters. During computation peaks, compute
nodes can flexibly join the cluster and share the burden. Auto-scaling has been
under active testi [...]
+
+
+
+For service availability guarantee, Apache Doris 2.0 supports **Cross-Cluster
Replication (CCR)**. As a disaster recovery solution, it supports read-write
separation and multi-data center backup.
+
+## Reach for the stars
+
+In the foreseeable future, Apache Doris will go further on the aforementioned
three directions: real-time, unified, and cloud-native.
+
+### Get even faster
+
+In the upcoming Apache Doris 2.1, the **cost-based query optimizer (CBO)**
will be able to automatically collect execution statistics and provide support
for hint syntax. It will also allow users to adjust the optimizing rules. To
fully demonstrate the performance of our CBO, we will release a TPC-DS
benchmark results.
+
+In addition, Doris 2.1 will support **multi-table materialized views** and
**writing intermediate results to disks**. Meanwhile, a Union All operator will
be added to accelerate the ETL process in Apache Doris. That means users will
experience higher performance and stability when processing large batches of
data. You can also expect a new Join algorithm that can double the execution
speed of multi-table join queries.
+
+In terms of **data writing**, we try to make it simpler and more intuitive for
you, and efforts will be made in three aspects.
+
+1. In future versions, data streams, local files, and those from relational
databases or data lakes will all be put into relational tables, and they can
all be written into Doris using the simple `insert into` statement.
+2. We will simplify the data writing pipeline. Data writing will be
implemented by the built-in job scheduling mechanism, so users won't need an
extra data synchronization component.
+3. When there is frequent data writing, Doris will wait until the data
accumulates into a sizable batch at the server end, so as to reduce the
pressure caused by small file merging.
+
+In terms of **data updating**, as the Merge-on-Write mechanism advances
towards maturity, it will be enabled in Doris by default. Users will be able to
flexibly update or modify any columns in tables as they want. Also, based on
Merge-on-Write, we will build a one-size-fits-all data model, so users don't
have to rack their brains choosing the right data model for various use cases.
+
+Apache Doris 2.1 will have enhanced **observability**. It will provide a brand
new Profile for users to monitor operator execution, and visualize the query
execution status with the aid of [Doris
Manager](https://github.com/apache/doris-manager).
+
+
+
+### Support more analytic scenarios
+
+The above-mentioned multi-table materialized view and built-in job scheduling
mechanism will also benefit the **data lakehousing** capability of Doris. From
heterogeneous data sources to the data warehouse, users won't need a second
component to do ETL and data warehouse layering.
+
+In version 2.0, we support data writeback to JDBC sources, and we are going to
expand that functionality to more data sources, including Apache Iceberg,
Apache Hudi, Delta Lake and Apache Paimon.
+
+
+
+For data ingestion from data lakes, Apache Doris currently adopts the MySQL
protocol. In large-scale data reading or data science use cases (like those
involving Pandas), this might be a throughput bottleneck. Thus, what we are
doing is introducing an Arrow Flight-based high-speed reading interface, which
transfers data via the Doris backends directly. **In our tests, the new
interface delivers a writing throughput that is 100 times higher.**
+
+
+
+For **log analysis**, the inverted index will support more complicated data
types, such as Array, Map, and GEO. We will also introduce a new data type
named Variant to provide **schema-free support**. This means users can not only
put Json data of any shapes and types in the table fields, but also easily
handle schema changes without any DDL operations.
+
+
+
+For **workload management**, we will enable higher flexibility. Users will be
able to use SQL to create, manage, and allocate resources for their Workload
Groups. We will continue to maximize resource utilization while ensuring
resource isolation between workload groups.
+
+### Cloud-nativeness and storage-compute separation
+
+When Apache Doris 2.0 was released, we previewed the merging of the SelectDB
Cloud storage-compute separation solution into the Apache Doris project. After
some intense code refactoring and compatibility building, this functionallity
will be good and ready in Apache Doris 2.2, and users will be able to
experience the elastic computation capability.
+
+
+
+## Stick to Innovation
+
+As Apache Doris is on the ramp, we look back on its ten-year development and
ask ourselves: **what injects vitality to this great project and keep it
vibrant for this long?** The answer is, we have been working with innovators.
+
+Back in the time when SQL on Hadoop gained currency, Apache Doris chose to
stay outside the Hadoop ecosystem. It does not rely on HDFS for data storage,
nor Zookeeper for distributed monitoring, but insists on providing high
availability by its scalable processes. When the major databases on the market
goes by their own syntaxes, Apache Doris adopts stand SQL and the MySQL
protocol, in order to lower the threshold for users.
+
+From the self-developed pre-aggregation storage engine, materialized views,
and the MPP framework, to inverted index, row/column hybrid storage, Light
Schema Change, Merge-on-Write, and the Variant data type, Apache Doris never
stops breaking new ground to provide better performance and user experience,
which is also what we are going to do next:
+
+- We want to work with more open-source enthusiasts to make a difference to
the world.
+- We want to keep inspiring the data world by presenting more use cases.
+- We want to provide more and better choices for users by collaborating with
partners along the data pipeline and cloud service providers.
+
+By choosing Apache Doris, you choose to stay in the heartbeat of innovation.
The [Apache Doris
community](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog)
awaits newcomers.
diff --git
a/static/images/summit2023/Apache-Doris-monthly-active-contributors.png
b/static/images/summit2023/Apache-Doris-monthly-active-contributors.png
new file mode 100644
index 00000000000..85250380290
Binary files /dev/null and
b/static/images/summit2023/Apache-Doris-monthly-active-contributors.png differ
diff --git a/static/images/summit2023/apache-doris-data-warehouse-layers.png
b/static/images/summit2023/apache-doris-data-warehouse-layers.png
new file mode 100644
index 00000000000..f052e95edc5
Binary files /dev/null and
b/static/images/summit2023/apache-doris-data-warehouse-layers.png differ
diff --git a/static/images/summit2023/data-writing-efficiency.png
b/static/images/summit2023/data-writing-efficiency.png
new file mode 100644
index 00000000000..5132f8d285d
Binary files /dev/null and
b/static/images/summit2023/data-writing-efficiency.png differ
diff --git a/static/images/summit2023/doris-manager.png
b/static/images/summit2023/doris-manager.png
new file mode 100644
index 00000000000..eaabe00af2f
Binary files /dev/null and b/static/images/summit2023/doris-manager.png differ
diff --git a/static/images/summit2023/hybrid-column-row-storage.png
b/static/images/summit2023/hybrid-column-row-storage.png
new file mode 100644
index 00000000000..1f508c48aa4
Binary files /dev/null and
b/static/images/summit2023/hybrid-column-row-storage.png differ
diff --git a/static/images/summit2023/kubernetes-operator-for-apache-doris.png
b/static/images/summit2023/kubernetes-operator-for-apache-doris.png
new file mode 100644
index 00000000000..93b63ba5d1d
Binary files /dev/null and
b/static/images/summit2023/kubernetes-operator-for-apache-doris.png differ
diff --git a/static/images/summit2023/merge-on-write.png
b/static/images/summit2023/merge-on-write.png
new file mode 100644
index 00000000000..15333139e97
Binary files /dev/null and b/static/images/summit2023/merge-on-write.png differ
diff --git a/static/images/summit2023/resource-isolation-workload-group.png
b/static/images/summit2023/resource-isolation-workload-group.png
new file mode 100644
index 00000000000..06c36c60a01
Binary files /dev/null and
b/static/images/summit2023/resource-isolation-workload-group.png differ
diff --git a/static/images/summit2023/schemaless-variant-data-type.png
b/static/images/summit2023/schemaless-variant-data-type.png
new file mode 100644
index 00000000000..e47bd315a5c
Binary files /dev/null and
b/static/images/summit2023/schemaless-variant-data-type.png differ
diff --git a/static/images/summit2023/storage-compute-separation.png
b/static/images/summit2023/storage-compute-separation.png
new file mode 100644
index 00000000000..3deaf3dcbde
Binary files /dev/null and
b/static/images/summit2023/storage-compute-separation.png differ
diff --git a/static/images/summit2023/tiered-storage.png
b/static/images/summit2023/tiered-storage.png
new file mode 100644
index 00000000000..5dd3d38a28c
Binary files /dev/null and b/static/images/summit2023/tiered-storage.png differ
diff --git a/static/images/summit2023/writing-throughput.png
b/static/images/summit2023/writing-throughput.png
new file mode 100644
index 00000000000..63894817472
Binary files /dev/null and b/static/images/summit2023/writing-throughput.png
differ
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]