sthetland commented on a change in pull request #11051:
URL: https://github.com/apache/druid/pull/11051#discussion_r604412994
##########
File path: docs/design/index.md
##########
@@ -22,79 +22,74 @@ title: "Introduction to Apache Druid"
~ under the License.
-->
-## What is Druid?
+Apache Druid is a real-time analytics database designed for fast
slice-and-dice analytics
("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries)
on large data sets. Most often, Druid powers use cases where real-time
ingestion, fast query performance, and high uptime are important.
-Apache Druid is a real-time analytics database designed for fast
slice-and-dice analytics
-("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries)
on large data sets. Druid is most often
-used as a database for powering use cases where real-time ingest, fast query
performance, and high uptime are important.
-As such, Druid is commonly used for powering GUIs of analytical applications,
or as a backend for highly-concurrent APIs
-that need fast aggregations. Druid works best with event-oriented data.
+Druid is commonly used as the database backend for GUIs of analytical
applications, or for highly-concurrent APIs that need fast aggregations. Druid
works best with event-oriented data.
Common application areas for Druid include:
-- Clickstream analytics (web and mobile analytics)
-- Network telemetry analytics (network performance monitoring)
+- Clickstream analytics including web and mobile analytics
+- Network telemetry analytics including network performance monitoring
- Server metrics storage
-- Supply chain analytics (manufacturing metrics)
+- Supply chain analytics including manufacturing metrics
- Application performance metrics
- Digital marketing/advertising analytics
-- Business intelligence / OLAP
+- Business intelligence/OLAP
+
+## Key features of Druid
Druid's core architecture combines ideas from data warehouses, timeseries
databases, and logsearch systems. Some of
Druid's key features are:
-1. **Columnar storage format.** Druid uses column-oriented storage, meaning it
only needs to load the exact columns
-needed for a particular query. This gives a huge speed boost to queries that
only hit a few columns. In addition, each
-column is stored optimized for its particular data type, which supports fast
scans and aggregations.
-2. **Scalable distributed system.** Druid is typically deployed in clusters of
tens to hundreds of servers, and can
-offer ingest rates of millions of records/sec, retention of trillions of
records, and query latencies of sub-second to a
-few seconds.
-3. **Massively parallel processing.** Druid can process a query in parallel
across the entire cluster.
-4. **Realtime or batch ingestion.** Druid can ingest data either real-time
(ingested data is immediately available for
-querying) or in batches.
-5. **Self-healing, self-balancing, easy to operate.** As an operator, to scale
the cluster out or in, simply add or
-remove servers and the cluster will rebalance itself automatically, in the
background, without any downtime. If any
-Druid servers fail, the system will automatically route around the damage
until those servers can be replaced. Druid
-is designed to run 24/7 with no need for planned downtimes for any reason,
including configuration changes and software
+1. **Columnar storage format.** Druid uses column-oriented storage. This means
it only loads the exact columns
+needed for a particular query. This greatly improves speed for queries that
retrieve only a few columns. Additionally, to support fast scans and
aggregations, Druid optimizes column storage for each column according to its
data type.
+2. **Scalable distributed system.** Typical Druid deployments span clusters
ranging from tens to hundreds of servers. Druid can ingest data at the rate of
millions of records per second while retaining trillions of records and
maintaining query latencies ranging from the sub-second to a few seconds.
+3. **Massively parallel processing.** Druid can process each query in parallel
across the entire cluster.
+4. **Realtime or batch ingestion.** Druid can ingest data either real-time or
in batches. Ingested data is immediately available for
+querying.
+5. **Self-healing, self-balancing, easy to operate.** As an operator, you add
servers to scale out or
+remove servers to scale down. The Druid cluster re-balances itself
automatically in the background without any downtime. If a
+Druid server fails, the system automatically routes data around the damage
until the server can be replaced. Druid
+is designed to run continuously without planned downtime for any reason. This
is true for configuration changes and software
updates.
-6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once
Druid has ingested your data, a copy is
-stored safely in [deep storage](architecture.md#deep-storage) (typically cloud
storage, HDFS, or a shared filesystem).
-Your data can be recovered from deep storage even if every single Druid server
fails. For more limited failures affecting
-just a few Druid servers, replication ensures that queries are still possible
while the system recovers.
+6. **Cloud-native, fault-tolerant architecture that won't lose data.** After
ingestion, Druid safely stores a copy of your data in [deep
storage](architecture.md#deep-storage). Deep storage is typically cloud
storage, HDFS, or a shared filesystem. You can recover your data from deep
storage even in the unlikely case that all Druid servers fail. For a limited
failure that affects only a few Druid servers, Druid uses replication to ensure
that queries are still possible during system recovers.
Review comment:
I think "recovers" -> "recoveries", although the original seems to refer
a little more clearly to "For a limited failure that affects..":
```suggestion
6. **Cloud-native, fault-tolerant architecture that won't lose data.** After
ingestion, Druid safely stores a copy of your data in [deep
storage](architecture.md#deep-storage). Deep storage is typically cloud
storage, HDFS, or a shared filesystem. You can recover your data from deep
storage even in the unlikely case that all Druid servers fail. For a limited
failure that affects only a few Druid servers, replication ensures that queries
are still possible during system recoveries.
```
##########
File path: docs/design/index.md
##########
@@ -22,79 +22,74 @@ title: "Introduction to Apache Druid"
~ under the License.
-->
-## What is Druid?
+Apache Druid is a real-time analytics database designed for fast
slice-and-dice analytics
("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries)
on large data sets. Most often, Druid powers use cases where real-time
ingestion, fast query performance, and high uptime are important.
-Apache Druid is a real-time analytics database designed for fast
slice-and-dice analytics
-("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries)
on large data sets. Druid is most often
-used as a database for powering use cases where real-time ingest, fast query
performance, and high uptime are important.
-As such, Druid is commonly used for powering GUIs of analytical applications,
or as a backend for highly-concurrent APIs
-that need fast aggregations. Druid works best with event-oriented data.
+Druid is commonly used as the database backend for GUIs of analytical
applications, or for highly-concurrent APIs that need fast aggregations. Druid
works best with event-oriented data.
Common application areas for Druid include:
-- Clickstream analytics (web and mobile analytics)
-- Network telemetry analytics (network performance monitoring)
+- Clickstream analytics including web and mobile analytics
+- Network telemetry analytics including network performance monitoring
- Server metrics storage
-- Supply chain analytics (manufacturing metrics)
+- Supply chain analytics including manufacturing metrics
- Application performance metrics
- Digital marketing/advertising analytics
-- Business intelligence / OLAP
+- Business intelligence/OLAP
+
+## Key features of Druid
Druid's core architecture combines ideas from data warehouses, timeseries
databases, and logsearch systems. Some of
Druid's key features are:
-1. **Columnar storage format.** Druid uses column-oriented storage, meaning it
only needs to load the exact columns
-needed for a particular query. This gives a huge speed boost to queries that
only hit a few columns. In addition, each
-column is stored optimized for its particular data type, which supports fast
scans and aggregations.
-2. **Scalable distributed system.** Druid is typically deployed in clusters of
tens to hundreds of servers, and can
-offer ingest rates of millions of records/sec, retention of trillions of
records, and query latencies of sub-second to a
-few seconds.
-3. **Massively parallel processing.** Druid can process a query in parallel
across the entire cluster.
-4. **Realtime or batch ingestion.** Druid can ingest data either real-time
(ingested data is immediately available for
-querying) or in batches.
-5. **Self-healing, self-balancing, easy to operate.** As an operator, to scale
the cluster out or in, simply add or
-remove servers and the cluster will rebalance itself automatically, in the
background, without any downtime. If any
-Druid servers fail, the system will automatically route around the damage
until those servers can be replaced. Druid
-is designed to run 24/7 with no need for planned downtimes for any reason,
including configuration changes and software
+1. **Columnar storage format.** Druid uses column-oriented storage. This means
it only loads the exact columns
+needed for a particular query. This greatly improves speed for queries that
retrieve only a few columns. Additionally, to support fast scans and
aggregations, Druid optimizes column storage for each column according to its
data type.
+2. **Scalable distributed system.** Typical Druid deployments span clusters
ranging from tens to hundreds of servers. Druid can ingest data at the rate of
millions of records per second while retaining trillions of records and
maintaining query latencies ranging from the sub-second to a few seconds.
+3. **Massively parallel processing.** Druid can process each query in parallel
across the entire cluster.
+4. **Realtime or batch ingestion.** Druid can ingest data either real-time or
in batches. Ingested data is immediately available for
+querying.
+5. **Self-healing, self-balancing, easy to operate.** As an operator, you add
servers to scale out or
+remove servers to scale down. The Druid cluster re-balances itself
automatically in the background without any downtime. If a
+Druid server fails, the system automatically routes data around the damage
until the server can be replaced. Druid
+is designed to run continuously without planned downtime for any reason. This
is true for configuration changes and software
updates.
-6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once
Druid has ingested your data, a copy is
-stored safely in [deep storage](architecture.md#deep-storage) (typically cloud
storage, HDFS, or a shared filesystem).
-Your data can be recovered from deep storage even if every single Druid server
fails. For more limited failures affecting
-just a few Druid servers, replication ensures that queries are still possible
while the system recovers.
+6. **Cloud-native, fault-tolerant architecture that won't lose data.** After
ingestion, Druid safely stores a copy of your data in [deep
storage](architecture.md#deep-storage). Deep storage is typically cloud
storage, HDFS, or a shared filesystem. You can recover your data from deep
storage even in the unlikely case that all Druid servers fail. For a limited
failure that affects only a few Druid servers, Druid uses replication to ensure
that queries are still possible during system recovers.
7. **Indexes for quick filtering.** Druid uses
[Roaring](https://roaringbitmap.org/) or
-[CONCISE](https://arxiv.org/pdf/1004.0403) compressed bitmap indexes to create
indexes that power fast filtering and
-searching across multiple columns.
-8. **Time-based partitioning.** Druid first partitions data by time, and can
additionally partition based on other fields.
-This means time-based queries will only access the partitions that match the
time range of the query. This leads to
-significant performance improvements for time-based data.
+[CONCISE](https://arxiv.org/pdf/1004.0403) compressed bitmap indexes to create
indexes to enable fast filtering and searching across multiple columns.
+8. **Time-based partitioning.** Druid first partitions data by time. YOu can
optionally implement additional partitioning based upon other fields.
Review comment:
```suggestion
8. **Time-based partitioning.** Druid first partitions data by time. You can
optionally implement additional partitioning based upon other fields.
```
##########
File path: docs/design/index.md
##########
@@ -22,79 +22,74 @@ title: "Introduction to Apache Druid"
~ under the License.
-->
-## What is Druid?
+Apache Druid is a real-time analytics database designed for fast
slice-and-dice analytics
("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries)
on large data sets. Most often, Druid powers use cases where real-time
ingestion, fast query performance, and high uptime are important.
-Apache Druid is a real-time analytics database designed for fast
slice-and-dice analytics
-("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries)
on large data sets. Druid is most often
-used as a database for powering use cases where real-time ingest, fast query
performance, and high uptime are important.
-As such, Druid is commonly used for powering GUIs of analytical applications,
or as a backend for highly-concurrent APIs
-that need fast aggregations. Druid works best with event-oriented data.
+Druid is commonly used as the database backend for GUIs of analytical
applications, or for highly-concurrent APIs that need fast aggregations. Druid
works best with event-oriented data.
Common application areas for Druid include:
-- Clickstream analytics (web and mobile analytics)
-- Network telemetry analytics (network performance monitoring)
+- Clickstream analytics including web and mobile analytics
+- Network telemetry analytics including network performance monitoring
- Server metrics storage
-- Supply chain analytics (manufacturing metrics)
+- Supply chain analytics including manufacturing metrics
- Application performance metrics
- Digital marketing/advertising analytics
-- Business intelligence / OLAP
+- Business intelligence/OLAP
+
+## Key features of Druid
Druid's core architecture combines ideas from data warehouses, timeseries
databases, and logsearch systems. Some of
Druid's key features are:
-1. **Columnar storage format.** Druid uses column-oriented storage, meaning it
only needs to load the exact columns
-needed for a particular query. This gives a huge speed boost to queries that
only hit a few columns. In addition, each
-column is stored optimized for its particular data type, which supports fast
scans and aggregations.
-2. **Scalable distributed system.** Druid is typically deployed in clusters of
tens to hundreds of servers, and can
-offer ingest rates of millions of records/sec, retention of trillions of
records, and query latencies of sub-second to a
-few seconds.
-3. **Massively parallel processing.** Druid can process a query in parallel
across the entire cluster.
-4. **Realtime or batch ingestion.** Druid can ingest data either real-time
(ingested data is immediately available for
-querying) or in batches.
-5. **Self-healing, self-balancing, easy to operate.** As an operator, to scale
the cluster out or in, simply add or
-remove servers and the cluster will rebalance itself automatically, in the
background, without any downtime. If any
-Druid servers fail, the system will automatically route around the damage
until those servers can be replaced. Druid
-is designed to run 24/7 with no need for planned downtimes for any reason,
including configuration changes and software
+1. **Columnar storage format.** Druid uses column-oriented storage. This means
it only loads the exact columns
+needed for a particular query. This greatly improves speed for queries that
retrieve only a few columns. Additionally, to support fast scans and
aggregations, Druid optimizes column storage for each column according to its
data type.
+2. **Scalable distributed system.** Typical Druid deployments span clusters
ranging from tens to hundreds of servers. Druid can ingest data at the rate of
millions of records per second while retaining trillions of records and
maintaining query latencies ranging from the sub-second to a few seconds.
+3. **Massively parallel processing.** Druid can process each query in parallel
across the entire cluster.
+4. **Realtime or batch ingestion.** Druid can ingest data either real-time or
in batches. Ingested data is immediately available for
+querying.
+5. **Self-healing, self-balancing, easy to operate.** As an operator, you add
servers to scale out or
+remove servers to scale down. The Druid cluster re-balances itself
automatically in the background without any downtime. If a
+Druid server fails, the system automatically routes data around the damage
until the server can be replaced. Druid
+is designed to run continuously without planned downtime for any reason. This
is true for configuration changes and software
updates.
-6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once
Druid has ingested your data, a copy is
-stored safely in [deep storage](architecture.md#deep-storage) (typically cloud
storage, HDFS, or a shared filesystem).
-Your data can be recovered from deep storage even if every single Druid server
fails. For more limited failures affecting
-just a few Druid servers, replication ensures that queries are still possible
while the system recovers.
+6. **Cloud-native, fault-tolerant architecture that won't lose data.** After
ingestion, Druid safely stores a copy of your data in [deep
storage](architecture.md#deep-storage). Deep storage is typically cloud
storage, HDFS, or a shared filesystem. You can recover your data from deep
storage even in the unlikely case that all Druid servers fail. For a limited
failure that affects only a few Druid servers, Druid uses replication to ensure
that queries are still possible during system recovers.
7. **Indexes for quick filtering.** Druid uses
[Roaring](https://roaringbitmap.org/) or
-[CONCISE](https://arxiv.org/pdf/1004.0403) compressed bitmap indexes to create
indexes that power fast filtering and
-searching across multiple columns.
-8. **Time-based partitioning.** Druid first partitions data by time, and can
additionally partition based on other fields.
-This means time-based queries will only access the partitions that match the
time range of the query. This leads to
-significant performance improvements for time-based data.
+[CONCISE](https://arxiv.org/pdf/1004.0403) compressed bitmap indexes to create
indexes to enable fast filtering and searching across multiple columns.
+8. **Time-based partitioning.** Druid first partitions data by time. YOu can
optionally implement additional partitioning based upon other fields.
+Time-based queries only access the partitions that match the time range of the
query which leads to significant performance improvements.
9. **Approximate algorithms.** Druid includes algorithms for approximate
count-distinct, approximate ranking, and
computation of approximate histograms and quantiles. These algorithms offer
bounded memory usage and are often
substantially faster than exact computations. For situations where accuracy is
more important than speed, Druid also
offers exact count-distinct and exact ranking.
10. **Automatic summarization at ingest time.** Druid optionally supports data
summarization at ingestion time. This
-summarization partially pre-aggregates your data, and can lead to big costs
savings and performance boosts.
+summarization partially pre-aggregates your data potentially leading to
significant cost savings and performance boosts.
Review comment:
The rewording is good, but it runs together without a comma, I think:
```suggestion
summarization partially pre-aggregates your data, potentially leading to
significant cost savings and performance boosts.
```
##########
File path: docs/design/index.md
##########
@@ -22,79 +22,74 @@ title: "Introduction to Apache Druid"
~ under the License.
-->
-## What is Druid?
+Apache Druid is a real-time analytics database designed for fast
slice-and-dice analytics
("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries)
on large data sets. Most often, Druid powers use cases where real-time
ingestion, fast query performance, and high uptime are important.
-Apache Druid is a real-time analytics database designed for fast
slice-and-dice analytics
-("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries)
on large data sets. Druid is most often
-used as a database for powering use cases where real-time ingest, fast query
performance, and high uptime are important.
-As such, Druid is commonly used for powering GUIs of analytical applications,
or as a backend for highly-concurrent APIs
-that need fast aggregations. Druid works best with event-oriented data.
+Druid is commonly used as the database backend for GUIs of analytical
applications, or for highly-concurrent APIs that need fast aggregations. Druid
works best with event-oriented data.
Common application areas for Druid include:
-- Clickstream analytics (web and mobile analytics)
-- Network telemetry analytics (network performance monitoring)
+- Clickstream analytics including web and mobile analytics
+- Network telemetry analytics including network performance monitoring
- Server metrics storage
-- Supply chain analytics (manufacturing metrics)
+- Supply chain analytics including manufacturing metrics
- Application performance metrics
- Digital marketing/advertising analytics
-- Business intelligence / OLAP
+- Business intelligence/OLAP
+
+## Key features of Druid
Druid's core architecture combines ideas from data warehouses, timeseries
databases, and logsearch systems. Some of
Druid's key features are:
-1. **Columnar storage format.** Druid uses column-oriented storage, meaning it
only needs to load the exact columns
-needed for a particular query. This gives a huge speed boost to queries that
only hit a few columns. In addition, each
-column is stored optimized for its particular data type, which supports fast
scans and aggregations.
-2. **Scalable distributed system.** Druid is typically deployed in clusters of
tens to hundreds of servers, and can
-offer ingest rates of millions of records/sec, retention of trillions of
records, and query latencies of sub-second to a
-few seconds.
-3. **Massively parallel processing.** Druid can process a query in parallel
across the entire cluster.
-4. **Realtime or batch ingestion.** Druid can ingest data either real-time
(ingested data is immediately available for
-querying) or in batches.
-5. **Self-healing, self-balancing, easy to operate.** As an operator, to scale
the cluster out or in, simply add or
-remove servers and the cluster will rebalance itself automatically, in the
background, without any downtime. If any
-Druid servers fail, the system will automatically route around the damage
until those servers can be replaced. Druid
-is designed to run 24/7 with no need for planned downtimes for any reason,
including configuration changes and software
+1. **Columnar storage format.** Druid uses column-oriented storage. This means
it only loads the exact columns
+needed for a particular query. This greatly improves speed for queries that
retrieve only a few columns. Additionally, to support fast scans and
aggregations, Druid optimizes column storage for each column according to its
data type.
+2. **Scalable distributed system.** Typical Druid deployments span clusters
ranging from tens to hundreds of servers. Druid can ingest data at the rate of
millions of records per second while retaining trillions of records and
maintaining query latencies ranging from the sub-second to a few seconds.
+3. **Massively parallel processing.** Druid can process each query in parallel
across the entire cluster.
+4. **Realtime or batch ingestion.** Druid can ingest data either real-time or
in batches. Ingested data is immediately available for
+querying.
+5. **Self-healing, self-balancing, easy to operate.** As an operator, you add
servers to scale out or
+remove servers to scale down. The Druid cluster re-balances itself
automatically in the background without any downtime. If a
+Druid server fails, the system automatically routes data around the damage
until the server can be replaced. Druid
+is designed to run continuously without planned downtime for any reason. This
is true for configuration changes and software
updates.
-6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once
Druid has ingested your data, a copy is
-stored safely in [deep storage](architecture.md#deep-storage) (typically cloud
storage, HDFS, or a shared filesystem).
-Your data can be recovered from deep storage even if every single Druid server
fails. For more limited failures affecting
-just a few Druid servers, replication ensures that queries are still possible
while the system recovers.
+6. **Cloud-native, fault-tolerant architecture that won't lose data.** After
ingestion, Druid safely stores a copy of your data in [deep
storage](architecture.md#deep-storage). Deep storage is typically cloud
storage, HDFS, or a shared filesystem. You can recover your data from deep
storage even in the unlikely case that all Druid servers fail. For a limited
failure that affects only a few Druid servers, Druid uses replication to ensure
that queries are still possible during system recovers.
7. **Indexes for quick filtering.** Druid uses
[Roaring](https://roaringbitmap.org/) or
-[CONCISE](https://arxiv.org/pdf/1004.0403) compressed bitmap indexes to create
indexes that power fast filtering and
-searching across multiple columns.
-8. **Time-based partitioning.** Druid first partitions data by time, and can
additionally partition based on other fields.
-This means time-based queries will only access the partitions that match the
time range of the query. This leads to
-significant performance improvements for time-based data.
+[CONCISE](https://arxiv.org/pdf/1004.0403) compressed bitmap indexes to create
indexes to enable fast filtering and searching across multiple columns.
+8. **Time-based partitioning.** Druid first partitions data by time. YOu can
optionally implement additional partitioning based upon other fields.
+Time-based queries only access the partitions that match the time range of the
query which leads to significant performance improvements.
9. **Approximate algorithms.** Druid includes algorithms for approximate
count-distinct, approximate ranking, and
computation of approximate histograms and quantiles. These algorithms offer
bounded memory usage and are often
substantially faster than exact computations. For situations where accuracy is
more important than speed, Druid also
offers exact count-distinct and exact ranking.
10. **Automatic summarization at ingest time.** Druid optionally supports data
summarization at ingestion time. This
-summarization partially pre-aggregates your data, and can lead to big costs
savings and performance boosts.
+summarization partially pre-aggregates your data potentially leading to
significant cost savings and performance boosts.
-## When should I use Druid?
+## When to use Druid
-Druid is used by many companies of various sizes for many different use cases.
Check out the
-[Powered by Apache Druid](/druid-powered) page
+Druid is used by many companies of various sizes for many different use cases.
For more information see
+[Powered by Apache Druid](/druid-powered).
-Druid is likely a good choice if your use case fits a few of the following
descriptors:
+Druid is likely a good choice if your use case has matches a few of the
following:
Review comment:
```suggestion
Druid is likely a good choice if your use case matches a few of the
following:
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]