techdocsmith commented on a change in pull request #11584:
URL: https://github.com/apache/druid/pull/11584#discussion_r720581464
##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
### Loading and serving segments
-Each Historical process maintains a constant connection to Zookeeper and
watches a configurable set of Zookeeper paths for new segment information.
Historical processes do not communicate directly with each other or with the
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep
Storage](../dependencies/deep-storage.md) to local disk in an area called the
*segment cache*. The size and location of the segment cache on each Historical
process is set using `druid.segmentCache.locations` in
[configuration](../configuration/index.html#historical-general-configuration).
-The [Coordinator](../design/coordinator.md) process is responsible for
assigning new segments to Historical processes. Assignment is done by creating
an ephemeral Zookeeper entry under a load queue path associated with a
Historical process. For more information on how the Coordinator assigns
segments to Historical processes, please see
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
-When a Historical process notices a new load queue entry in its load queue
path, it will first check a local disk directory (cache) for the information
about segment. If no information about the segment exists in the cache, the
Historical process will download metadata about the new segment to serve from
Zookeeper. This metadata includes specifications about where the segment is
located in deep storage and about how to decompress and process the segment.
For more information about segment metadata and Druid segments in general,
please see [Segments](../design/segments.md). Once a Historical process
completes processing a segment, the segment is announced in Zookeeper under a
served segments path associated with the process. At this point, the segment is
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments
to - and balance between - Historical processes.
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration;
Historical processes do not communicate directly with each other, nor do they
communicate directly with the Coordinator. Instead, the Coordinator creates
ephemeral Zookeeper entries under a [load queue
path](../configuration/index.html#path-configuration) and each Historical
process maintains a connection to Zookeeper, watching those paths for segment
information.
Review comment:
```suggestion
The [Coordinator](../design/coordinator.html) controls the assignment of
segments to Historicals and the balance of segments between Historicals.
Historical processes do not communicate directly with each other, nor do they
communicate directly with the Coordinator. Instead, the Coordinator creates
ephemeral entries in Zookeeper in a [load queue
path](../configuration/index.html#path-configuration). Each Historical process
maintains a connection to Zookeeper, watching those paths for segment
information.
```
What does this knowledge of the internals of segment assignment gain me.
What am I supposed to do with this knowledge?
##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the
following topics before
For instructions to configure query caching see [Using query
caching](./using-caching.md).
+Cache monitoring, including the hit rate and number of evictions, is available
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level
caching](../design/historical.md) on Historicals.
+
## Cache types
-Druid supports the following types of caches:
+Druid supports two types of query caching:
-- **Per-segment** caching which stores _partial results_ of a query for a
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
-To avoid returning stale results, Druid invalidates the cache the moment any
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important
for `table` datasources that have highly-variable underlying data segments,
including real-time data segments.
-Druid can store cache data on the local JVM heap or in an external distributed
key/value store. The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage
defaults to the minimum value of 1 GiB or the ten percent of the maximum
runtime memory for the JVM with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory
for the JVM, with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage. When using caffeine, the cache is inside
the JVM heap and is directly measurable. Heap usage will grow up to the
maximum configured size, and then the least recently used segment results will
be evicted and replaced with newer results.
### Per-segment caching
-The primary form of caching in Druid is the **per-segment cache** which stores
query results on a per-segment basis. It is enabled on Historical services by
default.
+The primary form of caching in Druid is a *per-segment results cache*. This
stores partial query results on a per-segment basis and is enabled on
Historical services by default.
Review comment:
```suggestion
The primary form of caching in Druid is a *per-segment results cache*. This
cache stores partial query results on a per-segment basis and is enabled on
Historical services by default.
```
nit: avoid "this" as a pronoun
##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
### Loading and serving segments
-Each Historical process maintains a constant connection to Zookeeper and
watches a configurable set of Zookeeper paths for new segment information.
Historical processes do not communicate directly with each other or with the
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep
Storage](../dependencies/deep-storage.md) to local disk in an area called the
*segment cache*. The size and location of the segment cache on each Historical
process is set using `druid.segmentCache.locations` in
[configuration](../configuration/index.html#historical-general-configuration).
-The [Coordinator](../design/coordinator.md) process is responsible for
assigning new segments to Historical processes. Assignment is done by creating
an ephemeral Zookeeper entry under a load queue path associated with a
Historical process. For more information on how the Coordinator assigns
segments to Historical processes, please see
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
-When a Historical process notices a new load queue entry in its load queue
path, it will first check a local disk directory (cache) for the information
about segment. If no information about the segment exists in the cache, the
Historical process will download metadata about the new segment to serve from
Zookeeper. This metadata includes specifications about where the segment is
located in deep storage and about how to decompress and process the segment.
For more information about segment metadata and Druid segments in general,
please see [Segments](../design/segments.md). Once a Historical process
completes processing a segment, the segment is announced in Zookeeper under a
served segments path associated with the process. At this point, the segment is
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments
to - and balance between - Historical processes.
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration;
Historical processes do not communicate directly with each other, nor do they
communicate directly with the Coordinator. Instead, the Coordinator creates
ephemeral Zookeeper entries under a [load queue
path](../configuration/index.html#path-configuration) and each Historical
process maintains a connection to Zookeeper, watching those paths for segment
information.
+
+For more information about how the Coordinator assigns segments to Historical
processes, please see [Coordinator](../design/coordinator.html).
+
+When a Historical process sees a new Zookeeper load queue entry, it checks its
own segment cache. If no information about the segment exists there, the
Historical process first retrieves metadata from Zookeeper about the segment,
including where the segment is located in Deep Storage and how it needs to
decompress and process it.
+
+For more information about segment metadata and Druid segments in general,
please see [Segments](../design/segments.html).
+
+Once a Historical process has completed pulling down and processing a segment
from Deep Storage, the segment is advertised as being available for queries.
This announcement by the Historical is made via Zookeeper, this time under a
[served segments path](../configuration/index.html#path-configuration). At this
point, the segment is considered available for querying by the Broker.
Review comment:
```suggestion
After a Historical process pulls down and processes a segment from Deep
Storage, Druid advertises the segment as being available for queries from the
Broker. This announcement by the Historical is made via Zookeeper, in a
[served segments path](../configuration/index.html#path-configuration).
```
Why do I need to know that the announcement is going through Zookeper?
##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
### Loading and serving segments
-Each Historical process maintains a constant connection to Zookeeper and
watches a configurable set of Zookeeper paths for new segment information.
Historical processes do not communicate directly with each other or with the
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep
Storage](../dependencies/deep-storage.md) to local disk in an area called the
*segment cache*. The size and location of the segment cache on each Historical
process is set using `druid.segmentCache.locations` in
[configuration](../configuration/index.html#historical-general-configuration).
-The [Coordinator](../design/coordinator.md) process is responsible for
assigning new segments to Historical processes. Assignment is done by creating
an ephemeral Zookeeper entry under a load queue path associated with a
Historical process. For more information on how the Coordinator assigns
segments to Historical processes, please see
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
-When a Historical process notices a new load queue entry in its load queue
path, it will first check a local disk directory (cache) for the information
about segment. If no information about the segment exists in the cache, the
Historical process will download metadata about the new segment to serve from
Zookeeper. This metadata includes specifications about where the segment is
located in deep storage and about how to decompress and process the segment.
For more information about segment metadata and Druid segments in general,
please see [Segments](../design/segments.md). Once a Historical process
completes processing a segment, the segment is announced in Zookeeper under a
served segments path associated with the process. At this point, the segment is
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments
to - and balance between - Historical processes.
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration;
Historical processes do not communicate directly with each other, nor do they
communicate directly with the Coordinator. Instead, the Coordinator creates
ephemeral Zookeeper entries under a [load queue
path](../configuration/index.html#path-configuration) and each Historical
process maintains a connection to Zookeeper, watching those paths for segment
information.
+
+For more information about how the Coordinator assigns segments to Historical
processes, please see [Coordinator](../design/coordinator.html).
+
+When a Historical process sees a new Zookeeper load queue entry, it checks its
own segment cache. If no information about the segment exists there, the
Historical process first retrieves metadata from Zookeeper about the segment,
including where the segment is located in Deep Storage and how it needs to
decompress and process it.
+
+For more information about segment metadata and Druid segments in general,
please see [Segments](../design/segments.html).
+
+Once a Historical process has completed pulling down and processing a segment
from Deep Storage, the segment is advertised as being available for queries.
This announcement by the Historical is made via Zookeeper, this time under a
[served segments path](../configuration/index.html#path-configuration). At this
point, the segment is considered available for querying by the Broker.
+
+For more information about how the Broker determines what data is available
for queries, please see [Broker](broker.html).
+
+On startup, a Historical process searches through its segment cache and, in
order for Historicals to be queried as soon as possible, immediately advertises
all segments it finds there.
Review comment:
```suggestion
To make data from the segment cache available for querying as soon as
possible, Historical services search the local segment cache upon startup and
advertise the segments found there.
```
##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the
following topics before
For instructions to configure query caching see [Using query
caching](./using-caching.md).
+Cache monitoring, including the hit rate and number of evictions, is available
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level
caching](../design/historical.md) on Historicals.
+
## Cache types
-Druid supports the following types of caches:
+Druid supports two types of query caching:
-- **Per-segment** caching which stores _partial results_ of a query for a
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
-To avoid returning stale results, Druid invalidates the cache the moment any
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important
for `table` datasources that have highly-variable underlying data segments,
including real-time data segments.
-Druid can store cache data on the local JVM heap or in an external distributed
key/value store. The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage
defaults to the minimum value of 1 GiB or the ten percent of the maximum
runtime memory for the JVM with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory
for the JVM, with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage. When using caffeine, the cache is inside
the JVM heap and is directly measurable. Heap usage will grow up to the
maximum configured size, and then the least recently used segment results will
be evicted and replaced with newer results.
Review comment:
Druid can store cache data on the local JVM heap or in an external
distributed key/value store. For example memcached. The default is a local
cache based upon [Caffeine](https://github.com/ben-manes/caffeine). The default
maximum cache storage size is the minimum of 1 GiB / ten percent of maximum
runtime memory for the JVM, with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage.
The caffeine-based cache resides within the JVM heap and is directly
measurable. Heap usage grows up to the maximum configured size. After that
Druid flushes the least recently used segment results and replaces them with
newer results.
```
I think of flushing from a cache not "evicting" maybe I'm confused.
##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the
following topics before
For instructions to configure query caching see [Using query
caching](./using-caching.md).
+Cache monitoring, including the hit rate and number of evictions, is available
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level
caching](../design/historical.md) on Historicals.
+
## Cache types
-Druid supports the following types of caches:
+Druid supports two types of query caching:
-- **Per-segment** caching which stores _partial results_ of a query for a
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
Review comment:
```suggestion
- [Whole-query caching](#whole-query-caching) stores final query results.
```
avoid italics for emphasis. Not sure we need emphasis here.
##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the
following topics before
For instructions to configure query caching see [Using query
caching](./using-caching.md).
+Cache monitoring, including the hit rate and number of evictions, is available
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level
caching](../design/historical.md) on Historicals.
+
## Cache types
-Druid supports the following types of caches:
+Druid supports two types of query caching:
-- **Per-segment** caching which stores _partial results_ of a query for a
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
-To avoid returning stale results, Druid invalidates the cache the moment any
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important
for `table` datasources that have highly-variable underlying data segments,
including real-time data segments.
Review comment:
```suggestion
Druid invalidates any cache the moment any underlying data change to avoid
returning stale results. This is especially important for `table` datasources
that have highly-variable underlying data segments, including real-time data
segments.
```
This is not an admonition. It's documentation of how the feature works. If
we need to influence customer behavior, it may be worth it to reiterate it
elsewhere.
##########
File path: docs/operations/basic-cluster-tuning.md
##########
@@ -90,11 +90,9 @@ Tuning the cluster so that each Historical can accept 50
queries and 10 non-quer
#### Segment Cache Size
-`druid.segmentCache.locations` specifies locations where segment data can be
stored on the Historical. The sum of available disk space across these
locations is set as the default value for property: `druid.server.maxSize`,
which controls the total size of segment data that can be assigned by the
Coordinator to a Historical.
+Avoid allocating a Historical an excessive amount of segment data. As the
value of (`free system memory` / total size of all
`druid.segmentCache.locations`) increases, a greater proportion of segments can
be kept in the [memory-mapped segment
cache](../design/historical.md#loading-and-serving-segments-from-cache),
allowing for better query performance.
Review comment:
```suggestion
For better query performance, do not allocate segment data to a Historical
in excess of the system free memory. When `free system memory` is greater than
or equal to `druid.segmentCache.locations`, the more segment data the
Historical can be hold in the memory-mapped segment cache.
```
##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the
following topics before
For instructions to configure query caching see [Using query
caching](./using-caching.md).
+Cache monitoring, including the hit rate and number of evictions, is available
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level
caching](../design/historical.md) on Historicals.
+
## Cache types
-Druid supports the following types of caches:
+Druid supports two types of query caching:
-- **Per-segment** caching which stores _partial results_ of a query for a
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
-To avoid returning stale results, Druid invalidates the cache the moment any
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important
for `table` datasources that have highly-variable underlying data segments,
including real-time data segments.
-Druid can store cache data on the local JVM heap or in an external distributed
key/value store. The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage
defaults to the minimum value of 1 GiB or the ten percent of the maximum
runtime memory for the JVM with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory
for the JVM, with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage. When using caffeine, the cache is inside
the JVM heap and is directly measurable. Heap usage will grow up to the
maximum configured size, and then the least recently used segment results will
be evicted and replaced with newer results.
### Per-segment caching
-The primary form of caching in Druid is the **per-segment cache** which stores
query results on a per-segment basis. It is enabled on Historical services by
default.
+The primary form of caching in Druid is a *per-segment results cache*. This
stores partial query results on a per-segment basis and is enabled on
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not
change, especially important for those segments that
[historical](../design/historical.html) processes pull into their local
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as
instructed by the lead [coordinator](../design/coordinator.html). Meanwhile,
real-time segments, on the other hand, continue to have results computed at
query time.
Review comment:
```suggestion
The per-segment results cache allows Druid to maintain a low-eviction-rate
cache for segments that do not change, especially important for those segments
that [historical](../design/historical.html) processes pull into their local
_segment cache_ from [deep storage](../dependencies/deep-storage.html).
Real-time segments, on the other hand, continue to have results computed at
query time.
```
Why does it matter that the Coordinator instructs them to do this? How do I
use that information in a practical sense?
##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the
following topics before
For instructions to configure query caching see [Using query
caching](./using-caching.md).
+Cache monitoring, including the hit rate and number of evictions, is available
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level
caching](../design/historical.md) on Historicals.
+
## Cache types
-Druid supports the following types of caches:
+Druid supports two types of query caching:
-- **Per-segment** caching which stores _partial results_ of a query for a
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
-To avoid returning stale results, Druid invalidates the cache the moment any
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important
for `table` datasources that have highly-variable underlying data segments,
including real-time data segments.
-Druid can store cache data on the local JVM heap or in an external distributed
key/value store. The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage
defaults to the minimum value of 1 GiB or the ten percent of the maximum
runtime memory for the JVM with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory
for the JVM, with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage. When using caffeine, the cache is inside
the JVM heap and is directly measurable. Heap usage will grow up to the
maximum configured size, and then the least recently used segment results will
be evicted and replaced with newer results.
### Per-segment caching
-The primary form of caching in Druid is the **per-segment cache** which stores
query results on a per-segment basis. It is enabled on Historical services by
default.
+The primary form of caching in Druid is a *per-segment results cache*. This
stores partial query results on a per-segment basis and is enabled on
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not
change, especially important for those segments that
[historical](../design/historical.html) processes pull into their local
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as
instructed by the lead [coordinator](../design/coordinator.html). Meanwhile,
real-time segments, on the other hand, continue to have results computed at
query time.
-When your queries include data from segments that are mutable and undergoing
real-time ingestion, use a segment cache. In this case Druid caches query
results for immutable historical segments when possible. It re-computes results
for the real-time segments at query time.
+Per-segment cached results also have the potential to be merged into the
results of later queries where there is a similar basic shape (filters,
aggregations, etc.) yet cover a different period of time, for example.
-For example, you have queries that frequently include incoming data from a
Kafka or Kinesis stream alongside unchanging segments. Per-segment caching lets
Druid cache results from older immutable segments and merge them with updated
data. Whole-query caching would not be helpful in this scenario because the new
data from real-time ingestion continually invalidates the cache.
+Per-segment caching is controlled by the parameters `useCache` and
`populateCache`.
Review comment:
Why do we need this here? This should be covered in the how to section.
##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
### Loading and serving segments
-Each Historical process maintains a constant connection to Zookeeper and
watches a configurable set of Zookeeper paths for new segment information.
Historical processes do not communicate directly with each other or with the
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep
Storage](../dependencies/deep-storage.md) to local disk in an area called the
*segment cache*. The size and location of the segment cache on each Historical
process is set using `druid.segmentCache.locations` in
[configuration](../configuration/index.html#historical-general-configuration).
-The [Coordinator](../design/coordinator.md) process is responsible for
assigning new segments to Historical processes. Assignment is done by creating
an ephemeral Zookeeper entry under a load queue path associated with a
Historical process. For more information on how the Coordinator assigns
segments to Historical processes, please see
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
-When a Historical process notices a new load queue entry in its load queue
path, it will first check a local disk directory (cache) for the information
about segment. If no information about the segment exists in the cache, the
Historical process will download metadata about the new segment to serve from
Zookeeper. This metadata includes specifications about where the segment is
located in deep storage and about how to decompress and process the segment.
For more information about segment metadata and Druid segments in general,
please see [Segments](../design/segments.md). Once a Historical process
completes processing a segment, the segment is announced in Zookeeper under a
served segments path associated with the process. At this point, the segment is
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments
to - and balance between - Historical processes.
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration;
Historical processes do not communicate directly with each other, nor do they
communicate directly with the Coordinator. Instead, the Coordinator creates
ephemeral Zookeeper entries under a [load queue
path](../configuration/index.html#path-configuration) and each Historical
process maintains a connection to Zookeeper, watching those paths for segment
information.
+
+For more information about how the Coordinator assigns segments to Historical
processes, please see [Coordinator](../design/coordinator.html).
+
+When a Historical process sees a new Zookeeper load queue entry, it checks its
own segment cache. If no information about the segment exists there, the
Historical process first retrieves metadata from Zookeeper about the segment,
including where the segment is located in Deep Storage and how it needs to
decompress and process it.
+
+For more information about segment metadata and Druid segments in general,
please see [Segments](../design/segments.html).
+
+Once a Historical process has completed pulling down and processing a segment
from Deep Storage, the segment is advertised as being available for queries.
This announcement by the Historical is made via Zookeeper, this time under a
[served segments path](../configuration/index.html#path-configuration). At this
point, the segment is considered available for querying by the Broker.
+
+For more information about how the Broker determines what data is available
for queries, please see [Broker](broker.html).
+
+On startup, a Historical process searches through its segment cache and, in
order for Historicals to be queried as soon as possible, immediately advertises
all segments it finds there.
### Loading and serving segments from cache
-Recall that when a Historical process notices a new segment entry in its load
queue path, the Historical process first checks a configurable cache directory
on its local disk to see if the segment had been previously downloaded. If a
local cache entry already exists, the Historical process will directly read the
segment binary files from disk and load the segment.
+A technique called [memory mapping](https://en.wikipedia.org/wiki/Mmap) is
used for the segment cache, consuming memory from the underlying operating
system so that parts of segment files can be held in memory, increasing query
performance at the data level. The in-memory segment cache is therefore
affected by, for example, the size of the Historical JVM, heap / direct memory
buffers, and other processes on the operating system itself.
+
+At query time, if the required part of a segment file is available in the
memory mapped cache (also known as the "page cache"), it will be re-used and
read directly from memory. If it is not, that part of the segment will be read
from disk. When this happens, there is potential for this new data to evict
other segment data from memory. Consequently, the closer that free operating
system memory is to `druid.server.maxSize`, the faster historical processes
typically respond at query time since segment data is very likely to be
available in memory. Conversely, the lower the free operating system memory,
the more likely a Historical is to read segments from disk.
Review comment:
```suggestion
At query time, if the required part of a segment file is available in the
memory mapped cache or "page cache", the Historical re-uses it and reads it
directly from memory. If it is not in the memory-mapped cache, the Historical
reads that part of the segment from disk. In this case, there is potential for
new data to flush other segment data from memory. This means that if free
operating system memory is close to `druid.server.maxSize`, the more likely
that segment data will be available in memory and reduce query times.
Conversely, the lower the free operating system memory, the more likely a
Historical is to read segments from disk.
```
This seems like the most important point to me. Like it should be in some
sort of operational best practice. For example I want to know when I'm setting
up my system that I want to be able to store as much of the data in memory as
possible to make queries fast.
##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
### Loading and serving segments
-Each Historical process maintains a constant connection to Zookeeper and
watches a configurable set of Zookeeper paths for new segment information.
Historical processes do not communicate directly with each other or with the
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep
Storage](../dependencies/deep-storage.md) to local disk in an area called the
*segment cache*. The size and location of the segment cache on each Historical
process is set using `druid.segmentCache.locations` in
[configuration](../configuration/index.html#historical-general-configuration).
-The [Coordinator](../design/coordinator.md) process is responsible for
assigning new segments to Historical processes. Assignment is done by creating
an ephemeral Zookeeper entry under a load queue path associated with a
Historical process. For more information on how the Coordinator assigns
segments to Historical processes, please see
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
-When a Historical process notices a new load queue entry in its load queue
path, it will first check a local disk directory (cache) for the information
about segment. If no information about the segment exists in the cache, the
Historical process will download metadata about the new segment to serve from
Zookeeper. This metadata includes specifications about where the segment is
located in deep storage and about how to decompress and process the segment.
For more information about segment metadata and Druid segments in general,
please see [Segments](../design/segments.md). Once a Historical process
completes processing a segment, the segment is announced in Zookeeper under a
served segments path associated with the process. At this point, the segment is
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments
to - and balance between - Historical processes.
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration;
Historical processes do not communicate directly with each other, nor do they
communicate directly with the Coordinator. Instead, the Coordinator creates
ephemeral Zookeeper entries under a [load queue
path](../configuration/index.html#path-configuration) and each Historical
process maintains a connection to Zookeeper, watching those paths for segment
information.
+
+For more information about how the Coordinator assigns segments to Historical
processes, please see [Coordinator](../design/coordinator.html).
+
+When a Historical process sees a new Zookeeper load queue entry, it checks its
own segment cache. If no information about the segment exists there, the
Historical process first retrieves metadata from Zookeeper about the segment,
including where the segment is located in Deep Storage and how it needs to
decompress and process it.
+
+For more information about segment metadata and Druid segments in general,
please see [Segments](../design/segments.html).
+
+Once a Historical process has completed pulling down and processing a segment
from Deep Storage, the segment is advertised as being available for queries.
This announcement by the Historical is made via Zookeeper, this time under a
[served segments path](../configuration/index.html#path-configuration). At this
point, the segment is considered available for querying by the Broker.
+
+For more information about how the Broker determines what data is available
for queries, please see [Broker](broker.html).
+
+On startup, a Historical process searches through its segment cache and, in
order for Historicals to be queried as soon as possible, immediately advertises
all segments it finds there.
### Loading and serving segments from cache
-Recall that when a Historical process notices a new segment entry in its load
queue path, the Historical process first checks a configurable cache directory
on its local disk to see if the segment had been previously downloaded. If a
local cache entry already exists, the Historical process will directly read the
segment binary files from disk and load the segment.
+A technique called [memory mapping](https://en.wikipedia.org/wiki/Mmap) is
used for the segment cache, consuming memory from the underlying operating
system so that parts of segment files can be held in memory, increasing query
performance at the data level. The in-memory segment cache is therefore
affected by, for example, the size of the Historical JVM, heap / direct memory
buffers, and other processes on the operating system itself.
Review comment:
```suggestion
The segment cache uses [memory mapping](https://en.wikipedia.org/wiki/Mmap).
The cache consumes memory from the underlying operating system so Historicals
can hold parts of segment files in memory to increase query performance at the
data level. The in-memory segment cache is affected by the size of the
Historical JVM, heap / direct memory buffers, and other processes on the
operating system itself.
```
##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
### Loading and serving segments
-Each Historical process maintains a constant connection to Zookeeper and
watches a configurable set of Zookeeper paths for new segment information.
Historical processes do not communicate directly with each other or with the
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep
Storage](../dependencies/deep-storage.md) to local disk in an area called the
*segment cache*. The size and location of the segment cache on each Historical
process is set using `druid.segmentCache.locations` in
[configuration](../configuration/index.html#historical-general-configuration).
-The [Coordinator](../design/coordinator.md) process is responsible for
assigning new segments to Historical processes. Assignment is done by creating
an ephemeral Zookeeper entry under a load queue path associated with a
Historical process. For more information on how the Coordinator assigns
segments to Historical processes, please see
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
Review comment:
```suggestion
See the [Tuning
Guide](../operations/basic-cluster-tuning.html#segment-cache-size) for more
information.
```
##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the
following topics before
For instructions to configure query caching see [Using query
caching](./using-caching.md).
+Cache monitoring, including the hit rate and number of evictions, is available
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level
caching](../design/historical.md) on Historicals.
+
## Cache types
-Druid supports the following types of caches:
+Druid supports two types of query caching:
-- **Per-segment** caching which stores _partial results_ of a query for a
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
-To avoid returning stale results, Druid invalidates the cache the moment any
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important
for `table` datasources that have highly-variable underlying data segments,
including real-time data segments.
-Druid can store cache data on the local JVM heap or in an external distributed
key/value store. The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage
defaults to the minimum value of 1 GiB or the ten percent of the maximum
runtime memory for the JVM with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory
for the JVM, with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage. When using caffeine, the cache is inside
the JVM heap and is directly measurable. Heap usage will grow up to the
maximum configured size, and then the least recently used segment results will
be evicted and replaced with newer results.
### Per-segment caching
-The primary form of caching in Druid is the **per-segment cache** which stores
query results on a per-segment basis. It is enabled on Historical services by
default.
+The primary form of caching in Druid is a *per-segment results cache*. This
stores partial query results on a per-segment basis and is enabled on
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not
change, especially important for those segments that
[historical](../design/historical.html) processes pull into their local
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as
instructed by the lead [coordinator](../design/coordinator.html). Meanwhile,
real-time segments, on the other hand, continue to have results computed at
query time.
-When your queries include data from segments that are mutable and undergoing
real-time ingestion, use a segment cache. In this case Druid caches query
results for immutable historical segments when possible. It re-computes results
for the real-time segments at query time.
+Per-segment cached results also have the potential to be merged into the
results of later queries where there is a similar basic shape (filters,
aggregations, etc.) yet cover a different period of time, for example.
-For example, you have queries that frequently include incoming data from a
Kafka or Kinesis stream alongside unchanging segments. Per-segment caching lets
Druid cache results from older immutable segments and merge them with updated
data. Whole-query caching would not be helpful in this scenario because the new
data from real-time ingestion continually invalidates the cache.
+Per-segment caching is controlled by the parameters `useCache` and
`populateCache`.
+
+> **Use per-segment caching with real-time data**
+>
+> For example, you could have queries that address fresh data arriving from
Kafka, and which additionally covers intervals in segments that are loaded on
historicals. Per-segment caching allows Druid to cache results from the
historical segments confidently, and to merge them with real-time results from
the stream. [Whole-query caching](#whole-query-caching), on the other hand,
would not be helpful in this scenario because new data from real-time ingestion
will continually invalidate the _entire_ result.
### Whole-query caching
-If real-time ingestion invalidating the cache is not an issue for your
queries, you can use **whole-query caching** on the Broker to increase query
efficiency. The Broker performs whole-query caching operations before sending
fan out queries to Historicals. Therefore Druid no longer needs to merge the
per-segment results on the Broker.
+Here, entire results of individual queries are cached, meaning Druid no longer
needs to merge the per-segment results on the Broker.
+
+Whole-query result caching is controlled by the parameters
`useResultLevelCache` and `populateResultLevelCache` and runtime properties
`druid.broker.cache.*`.
Review comment:
This should be covered in the "how to" "Using caching.
##########
File path: docs/operations/basic-cluster-tuning.md
##########
@@ -90,11 +90,9 @@ Tuning the cluster so that each Historical can accept 50
queries and 10 non-quer
#### Segment Cache Size
-`druid.segmentCache.locations` specifies locations where segment data can be
stored on the Historical. The sum of available disk space across these
locations is set as the default value for property: `druid.server.maxSize`,
which controls the total size of segment data that can be assigned by the
Coordinator to a Historical.
+Avoid allocating a Historical an excessive amount of segment data. As the
value of (`free system memory` / total size of all
`druid.segmentCache.locations`) increases, a greater proportion of segments can
be kept in the [memory-mapped segment
cache](../design/historical.md#loading-and-serving-segments-from-cache),
allowing for better query performance.
-Segments are memory-mapped by Historical processes using any available free
system memory (i.e., memory not used by the Historical JVM and heap/direct
memory buffers or other processes on the system). Segments that are not
currently in memory will be paged from disk when queried.
-
-Therefore, the size of cache locations set within
`druid.segmentCache.locations` should be such that a Historical is not
allocated an excessive amount of segment data. As the value of (`free system
memory` / total size of all `druid.segmentCache.locations`) increases, a
greater proportion of segments can be kept in memory, allowing for better query
performance. The total segment data size assigned to a Historical can be
overridden with `druid.server.maxSize`, but this is not required for most of
the use cases.
+The total segment data size assigned to a Historical is calculated using
`druid.segmentCache.locations` but can be overridden with
`druid.server.maxSize`. This is not required for most of the use cases.
Review comment:
```suggestion
Druid uses the `druid.segmentCache.locations` to calculate the total segment
data size assigned to a Historical. For some rarer use cases, you can override
this behavior with `druid.server.maxSize` property.
```
##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the
following topics before
For instructions to configure query caching see [Using query
caching](./using-caching.md).
+Cache monitoring, including the hit rate and number of evictions, is available
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level
caching](../design/historical.md) on Historicals.
+
## Cache types
-Druid supports the following types of caches:
+Druid supports two types of query caching:
-- **Per-segment** caching which stores _partial results_ of a query for a
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
-To avoid returning stale results, Druid invalidates the cache the moment any
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important
for `table` datasources that have highly-variable underlying data segments,
including real-time data segments.
-Druid can store cache data on the local JVM heap or in an external distributed
key/value store. The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage
defaults to the minimum value of 1 GiB or the ten percent of the maximum
runtime memory for the JVM with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory
for the JVM, with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage. When using caffeine, the cache is inside
the JVM heap and is directly measurable. Heap usage will grow up to the
maximum configured size, and then the least recently used segment results will
be evicted and replaced with newer results.
### Per-segment caching
-The primary form of caching in Druid is the **per-segment cache** which stores
query results on a per-segment basis. It is enabled on Historical services by
default.
+The primary form of caching in Druid is a *per-segment results cache*. This
stores partial query results on a per-segment basis and is enabled on
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not
change, especially important for those segments that
[historical](../design/historical.html) processes pull into their local
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as
instructed by the lead [coordinator](../design/coordinator.html). Meanwhile,
real-time segments, on the other hand, continue to have results computed at
query time.
-When your queries include data from segments that are mutable and undergoing
real-time ingestion, use a segment cache. In this case Druid caches query
results for immutable historical segments when possible. It re-computes results
for the real-time segments at query time.
+Per-segment cached results also have the potential to be merged into the
results of later queries where there is a similar basic shape (filters,
aggregations, etc.) yet cover a different period of time, for example.
Review comment:
```suggestion
Druid may potentially merge per-segment cached results with the results of
later queries that use a similar basic shape with similar filters,
aggregations, etc. For example, if the query is identical except that it covers
a different time period.
##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the
following topics before
For instructions to configure query caching see [Using query
caching](./using-caching.md).
+Cache monitoring, including the hit rate and number of evictions, is available
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level
caching](../design/historical.md) on Historicals.
+
## Cache types
-Druid supports the following types of caches:
+Druid supports two types of query caching:
-- **Per-segment** caching which stores _partial results_ of a query for a
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
-To avoid returning stale results, Druid invalidates the cache the moment any
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important
for `table` datasources that have highly-variable underlying data segments,
including real-time data segments.
-Druid can store cache data on the local JVM heap or in an external distributed
key/value store. The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage
defaults to the minimum value of 1 GiB or the ten percent of the maximum
runtime memory for the JVM with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory
for the JVM, with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage. When using caffeine, the cache is inside
the JVM heap and is directly measurable. Heap usage will grow up to the
maximum configured size, and then the least recently used segment results will
be evicted and replaced with newer results.
### Per-segment caching
-The primary form of caching in Druid is the **per-segment cache** which stores
query results on a per-segment basis. It is enabled on Historical services by
default.
+The primary form of caching in Druid is a *per-segment results cache*. This
stores partial query results on a per-segment basis and is enabled on
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not
change, especially important for those segments that
[historical](../design/historical.html) processes pull into their local
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as
instructed by the lead [coordinator](../design/coordinator.html). Meanwhile,
real-time segments, on the other hand, continue to have results computed at
query time.
-When your queries include data from segments that are mutable and undergoing
real-time ingestion, use a segment cache. In this case Druid caches query
results for immutable historical segments when possible. It re-computes results
for the real-time segments at query time.
+Per-segment cached results also have the potential to be merged into the
results of later queries where there is a similar basic shape (filters,
aggregations, etc.) yet cover a different period of time, for example.
-For example, you have queries that frequently include incoming data from a
Kafka or Kinesis stream alongside unchanging segments. Per-segment caching lets
Druid cache results from older immutable segments and merge them with updated
data. Whole-query caching would not be helpful in this scenario because the new
data from real-time ingestion continually invalidates the cache.
+Per-segment caching is controlled by the parameters `useCache` and
`populateCache`.
+
+> **Use per-segment caching with real-time data**
+>
+> For example, you could have queries that address fresh data arriving from
Kafka, and which additionally covers intervals in segments that are loaded on
historicals. Per-segment caching allows Druid to cache results from the
historical segments confidently, and to merge them with real-time results from
the stream. [Whole-query caching](#whole-query-caching), on the other hand,
would not be helpful in this scenario because new data from real-time ingestion
will continually invalidate the _entire_ result.
Review comment:
```suggestion
Use per-segment caching with real-time data. For example, your queries
request data actively arriving from Kafka alongside intervals in segments that
are loaded on Historicals. Druid can merge cached results from Historical
segments with real-time results from the stream. [Whole-query
caching](#whole-query-caching), on the other hand, is not helpful in this
scenario because new data from real-time ingestion will continually invalidate
the entire cached result.
```
##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
### Loading and serving segments
-Each Historical process maintains a constant connection to Zookeeper and
watches a configurable set of Zookeeper paths for new segment information.
Historical processes do not communicate directly with each other or with the
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep
Storage](../dependencies/deep-storage.md) to local disk in an area called the
*segment cache*. The size and location of the segment cache on each Historical
process is set using `druid.segmentCache.locations` in
[configuration](../configuration/index.html#historical-general-configuration).
-The [Coordinator](../design/coordinator.md) process is responsible for
assigning new segments to Historical processes. Assignment is done by creating
an ephemeral Zookeeper entry under a load queue path associated with a
Historical process. For more information on how the Coordinator assigns
segments to Historical processes, please see
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
-When a Historical process notices a new load queue entry in its load queue
path, it will first check a local disk directory (cache) for the information
about segment. If no information about the segment exists in the cache, the
Historical process will download metadata about the new segment to serve from
Zookeeper. This metadata includes specifications about where the segment is
located in deep storage and about how to decompress and process the segment.
For more information about segment metadata and Druid segments in general,
please see [Segments](../design/segments.md). Once a Historical process
completes processing a segment, the segment is announced in Zookeeper under a
served segments path associated with the process. At this point, the segment is
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments
to - and balance between - Historical processes.
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration;
Historical processes do not communicate directly with each other, nor do they
communicate directly with the Coordinator. Instead, the Coordinator creates
ephemeral Zookeeper entries under a [load queue
path](../configuration/index.html#path-configuration) and each Historical
process maintains a connection to Zookeeper, watching those paths for segment
information.
+
+For more information about how the Coordinator assigns segments to Historical
processes, please see [Coordinator](../design/coordinator.html).
+
+When a Historical process sees a new Zookeeper load queue entry, it checks its
own segment cache. If no information about the segment exists there, the
Historical process first retrieves metadata from Zookeeper about the segment,
including where the segment is located in Deep Storage and how it needs to
decompress and process it.
+
+For more information about segment metadata and Druid segments in general,
please see [Segments](../design/segments.html).
Review comment:
```suggestion
For more information about segment metadata and Druid segments in general,
see [Segments](../design/segments.html).
```
##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the
following topics before
For instructions to configure query caching see [Using query
caching](./using-caching.md).
+Cache monitoring, including the hit rate and number of evictions, is available
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level
caching](../design/historical.md) on Historicals.
+
## Cache types
-Druid supports the following types of caches:
+Druid supports two types of query caching:
-- **Per-segment** caching which stores _partial results_ of a query for a
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results
for a specific segment (enabled by default).
Review comment:
```suggestion
- [Per-segment caching](#per-segment-caching) stores partial query results
for a specific segment. It is enabled by default.
```
avoid parens
##########
File path: docs/querying/caching.md
##########
@@ -69,9 +89,11 @@ For instance, whole-query caching is a good option when you
have queries that in
- On Brokers for small production clusters with less than five servers.
- Do not use per-segment caches on the Broker for large production
clusters. When `druid.broker.cache.populateCache` is `true` and query context
parameter `populateCache` _is not_ `false`, Historicals return results on a
per-segment basis without merging results locally thus negatively impacting
cluster scalability.
+> **Avoid using per-segment cache at the Broker for large production clusters**
+>
+> When the Broker cache is enabled (`druid.broker.cache.populateCache` is
`true`) and `populateCache` _is not_ `false` in the [query
context](../querying/query-context.html), individual Historicals will _not_
merge individual segment-level results, and instead pass these back to the lead
Broker. The Broker must then carry out a large merge from _all_ segments on
its own.
Review comment:
```suggestion
Avoid using per-segment cache at the Broker for large production clusters.
When the Broker cache is enabled (`druid.broker.cache.populateCache` is `true`)
and `populateCache` _is not_ `false` in the [query
context](../querying/query-context.html), individual Historicals will _not_
merge individual segment-level results, and instead pass these back to the lead
Broker. The Broker must then carry out a large merge from _all_ segments on
its own.
```
##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the
following topics before
For instructions to configure query caching see [Using query
caching](./using-caching.md).
+Cache monitoring, including the hit rate and number of evictions, is available
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level
caching](../design/historical.md) on Historicals.
+
## Cache types
-Druid supports the following types of caches:
+Druid supports two types of query caching:
-- **Per-segment** caching which stores _partial results_ of a query for a
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
-To avoid returning stale results, Druid invalidates the cache the moment any
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important
for `table` datasources that have highly-variable underlying data segments,
including real-time data segments.
-Druid can store cache data on the local JVM heap or in an external distributed
key/value store. The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage
defaults to the minimum value of 1 GiB or the ten percent of the maximum
runtime memory for the JVM with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory
for the JVM, with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage. When using caffeine, the cache is inside
the JVM heap and is directly measurable. Heap usage will grow up to the
maximum configured size, and then the least recently used segment results will
be evicted and replaced with newer results.
### Per-segment caching
-The primary form of caching in Druid is the **per-segment cache** which stores
query results on a per-segment basis. It is enabled on Historical services by
default.
+The primary form of caching in Druid is a *per-segment results cache*. This
stores partial query results on a per-segment basis and is enabled on
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not
change, especially important for those segments that
[historical](../design/historical.html) processes pull into their local
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as
instructed by the lead [coordinator](../design/coordinator.html). Meanwhile,
real-time segments, on the other hand, continue to have results computed at
query time.
-When your queries include data from segments that are mutable and undergoing
real-time ingestion, use a segment cache. In this case Druid caches query
results for immutable historical segments when possible. It re-computes results
for the real-time segments at query time.
+Per-segment cached results also have the potential to be merged into the
results of later queries where there is a similar basic shape (filters,
aggregations, etc.) yet cover a different period of time, for example.
-For example, you have queries that frequently include incoming data from a
Kafka or Kinesis stream alongside unchanging segments. Per-segment caching lets
Druid cache results from older immutable segments and merge them with updated
data. Whole-query caching would not be helpful in this scenario because the new
data from real-time ingestion continually invalidates the cache.
+Per-segment caching is controlled by the parameters `useCache` and
`populateCache`.
+
+> **Use per-segment caching with real-time data**
+>
+> For example, you could have queries that address fresh data arriving from
Kafka, and which additionally covers intervals in segments that are loaded on
historicals. Per-segment caching allows Druid to cache results from the
historical segments confidently, and to merge them with real-time results from
the stream. [Whole-query caching](#whole-query-caching), on the other hand,
would not be helpful in this scenario because new data from real-time ingestion
will continually invalidate the _entire_ result.
### Whole-query caching
-If real-time ingestion invalidating the cache is not an issue for your
queries, you can use **whole-query caching** on the Broker to increase query
efficiency. The Broker performs whole-query caching operations before sending
fan out queries to Historicals. Therefore Druid no longer needs to merge the
per-segment results on the Broker.
+Here, entire results of individual queries are cached, meaning Druid no longer
needs to merge the per-segment results on the Broker.
Review comment:
```suggestion
With *whole-query caching*, Druid caches the entire results of individual
queries. In this case the Broker no longer needs to merge the per-segment
results.
```
##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
### Loading and serving segments
-Each Historical process maintains a constant connection to Zookeeper and
watches a configurable set of Zookeeper paths for new segment information.
Historical processes do not communicate directly with each other or with the
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep
Storage](../dependencies/deep-storage.md) to local disk in an area called the
*segment cache*. The size and location of the segment cache on each Historical
process is set using `druid.segmentCache.locations` in
[configuration](../configuration/index.html#historical-general-configuration).
Review comment:
```suggestion
Each Historical process copies or "pulls" segment files from Deep Storage to
local disk in an area called the *segment cache*. Set the
`druid.segmentCache.locations` to configure the size and location of the
segment cache on each Historical process. See [Historical general
configuration](../configuration/index.html#historical-general-configuration).
```
##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the
following topics before
For instructions to configure query caching see [Using query
caching](./using-caching.md).
+Cache monitoring, including the hit rate and number of evictions, is available
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level
caching](../design/historical.md) on Historicals.
+
## Cache types
-Druid supports the following types of caches:
+Druid supports two types of query caching:
-- **Per-segment** caching which stores _partial results_ of a query for a
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
-To avoid returning stale results, Druid invalidates the cache the moment any
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important
for `table` datasources that have highly-variable underlying data segments,
including real-time data segments.
-Druid can store cache data on the local JVM heap or in an external distributed
key/value store. The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage
defaults to the minimum value of 1 GiB or the ten percent of the maximum
runtime memory for the JVM with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory
for the JVM, with no cache expiration. See [Cache
configuration](../configuration/index.md#cache-configuration) for information
on how to configure cache storage. When using caffeine, the cache is inside
the JVM heap and is directly measurable. Heap usage will grow up to the
maximum configured size, and then the least recently used segment results will
be evicted and replaced with newer results.
### Per-segment caching
-The primary form of caching in Druid is the **per-segment cache** which stores
query results on a per-segment basis. It is enabled on Historical services by
default.
+The primary form of caching in Druid is a *per-segment results cache*. This
stores partial query results on a per-segment basis and is enabled on
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not
change, especially important for those segments that
[historical](../design/historical.html) processes pull into their local
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as
instructed by the lead [coordinator](../design/coordinator.html). Meanwhile,
real-time segments, on the other hand, continue to have results computed at
query time.
-When your queries include data from segments that are mutable and undergoing
real-time ingestion, use a segment cache. In this case Druid caches query
results for immutable historical segments when possible. It re-computes results
for the real-time segments at query time.
+Per-segment cached results also have the potential to be merged into the
results of later queries where there is a similar basic shape (filters,
aggregations, etc.) yet cover a different period of time, for example.
-For example, you have queries that frequently include incoming data from a
Kafka or Kinesis stream alongside unchanging segments. Per-segment caching lets
Druid cache results from older immutable segments and merge them with updated
data. Whole-query caching would not be helpful in this scenario because the new
data from real-time ingestion continually invalidates the cache.
+Per-segment caching is controlled by the parameters `useCache` and
`populateCache`.
+
+> **Use per-segment caching with real-time data**
+>
+> For example, you could have queries that address fresh data arriving from
Kafka, and which additionally covers intervals in segments that are loaded on
historicals. Per-segment caching allows Druid to cache results from the
historical segments confidently, and to merge them with real-time results from
the stream. [Whole-query caching](#whole-query-caching), on the other hand,
would not be helpful in this scenario because new data from real-time ingestion
will continually invalidate the _entire_ result.
### Whole-query caching
-If real-time ingestion invalidating the cache is not an issue for your
queries, you can use **whole-query caching** on the Broker to increase query
efficiency. The Broker performs whole-query caching operations before sending
fan out queries to Historicals. Therefore Druid no longer needs to merge the
per-segment results on the Broker.
+Here, entire results of individual queries are cached, meaning Druid no longer
needs to merge the per-segment results on the Broker.
+
+Whole-query result caching is controlled by the parameters
`useResultLevelCache` and `populateResultLevelCache` and runtime properties
`druid.broker.cache.*`.
-For instance, whole-query caching is a good option when you have queries that
include data from a batch ingestion task that runs every few hours or once a
day. Per-segment caching would be less efficient in this case because it
requires Druid to merge the per-segment results for each query, even when the
results are cached.
+> **Use whole-query caching when segments are stable and you are not using
real-time ingestion**
+>
+> If real-time ingestion invalidating the cache is not an issue for your
queries, you can use *whole-query caching* on the Broker to increase query
efficiency. The Broker performs whole-query caching operations before sending
fan out queries to Historicals.
+>
+> For instance, whole-query caching is a good option when you have queries
that include data from a batch ingestion task that runs every few hours or once
a day. [Per-segment caching](#per-segment-caching) would be less efficient in
this case because it requires Druid to merge the per-segment results for each
query, even when the results are cached.
Review comment:
```suggestion
Use whole-query caching when segments are stable and you are not using
real-time ingestion. If real-time ingestion invalidating the cache is not an
issue for your queries, you can use *whole-query caching* on the Broker to
increase query efficiency.
For instance, whole-query caching is a good option when you have queries
that include data from a batch ingestion task that runs every few hours or once
a day. [Per-segment caching](#per-segment-caching) would be less efficient in
this case because it requires Druid to merge the per-segment results for each
query, even when the results are cached.
```
Why do I care that the Broker performs the caching before the fan out? How
does that modify my behavior?
##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
### Loading and serving segments
-Each Historical process maintains a constant connection to Zookeeper and
watches a configurable set of Zookeeper paths for new segment information.
Historical processes do not communicate directly with each other or with the
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep
Storage](../dependencies/deep-storage.md) to local disk in an area called the
*segment cache*. The size and location of the segment cache on each Historical
process is set using `druid.segmentCache.locations` in
[configuration](../configuration/index.html#historical-general-configuration).
-The [Coordinator](../design/coordinator.md) process is responsible for
assigning new segments to Historical processes. Assignment is done by creating
an ephemeral Zookeeper entry under a load queue path associated with a
Historical process. For more information on how the Coordinator assigns
segments to Historical processes, please see
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
-When a Historical process notices a new load queue entry in its load queue
path, it will first check a local disk directory (cache) for the information
about segment. If no information about the segment exists in the cache, the
Historical process will download metadata about the new segment to serve from
Zookeeper. This metadata includes specifications about where the segment is
located in deep storage and about how to decompress and process the segment.
For more information about segment metadata and Druid segments in general,
please see [Segments](../design/segments.md). Once a Historical process
completes processing a segment, the segment is announced in Zookeeper under a
served segments path associated with the process. At this point, the segment is
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments
to - and balance between - Historical processes.
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration;
Historical processes do not communicate directly with each other, nor do they
communicate directly with the Coordinator. Instead, the Coordinator creates
ephemeral Zookeeper entries under a [load queue
path](../configuration/index.html#path-configuration) and each Historical
process maintains a connection to Zookeeper, watching those paths for segment
information.
+
+For more information about how the Coordinator assigns segments to Historical
processes, please see [Coordinator](../design/coordinator.html).
+
+When a Historical process sees a new Zookeeper load queue entry, it checks its
own segment cache. If no information about the segment exists there, the
Historical process first retrieves metadata from Zookeeper about the segment,
including where the segment is located in Deep Storage and how it needs to
decompress and process it.
Review comment:
```suggestion
When a Historical process detects a new entry in the Zookeeper load queue,
it checks its own segment cache. If no information about the segment exists
there, the Historical process first retrieves metadata from Zookeeper about the
segment, including where the segment is located in Deep Storage and how it
needs to decompress and process it.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]