[GitHub] [druid] techdocsmith commented on a change in pull request #11584: Docs - query caching

GitBox Fri, 01 Oct 2021 17:02:21 -0700


techdocsmith commented on a change in pull request #11584:
URL: https://github.com/apache/druid/pull/11584#discussion_r720581464




##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
 
 ### Loading and serving segments
 
-Each Historical process maintains a constant connection to Zookeeper and 
watches a configurable set of Zookeeper paths for new segment information. 
Historical processes do not communicate directly with each other or with the 
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep 
Storage](../dependencies/deep-storage.md) to local disk in an area called the 
*segment cache*.  The size and location of the segment cache on each Historical 
process is set using `druid.segmentCache.locations` in 
[configuration](../configuration/index.html#historical-general-configuration).
 
-The [Coordinator](../design/coordinator.md) process is responsible for 
assigning new segments to Historical processes. Assignment is done by creating 
an ephemeral Zookeeper entry under a load queue path associated with a 
Historical process. For more information on how the Coordinator assigns 
segments to Historical processes, please see 
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning 
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
 
-When a Historical process notices a new load queue entry in its load queue 
path, it will first check a local disk directory (cache) for the information 
about segment. If no information about the segment exists in the cache, the 
Historical process will download metadata about the new segment to serve from 
Zookeeper. This metadata includes specifications about where the segment is 
located in deep storage and about how to decompress and process the segment. 
For more information about segment metadata and Druid segments in general, 
please see [Segments](../design/segments.md). Once a Historical process 
completes processing a segment, the segment is announced in Zookeeper under a 
served segments path associated with the process. At this point, the segment is 
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments 
to - and balance between - Historical processes.  
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration; 
Historical processes do not communicate directly with each other, nor do they 
communicate directly with the Coordinator.  Instead, the Coordinator creates 
ephemeral Zookeeper entries under a [load queue 
path](../configuration/index.html#path-configuration) and each Historical 
process maintains a connection to Zookeeper, watching those paths for segment 
information.

Review comment:
       ```suggestion
   The [Coordinator](../design/coordinator.html) controls the assignment of 
segments to Historicals and the balance of segments between Historicals. 
Historical processes do not communicate directly with each other, nor do they 
communicate directly with the Coordinator.  Instead, the Coordinator creates 
ephemeral entries in Zookeeper in a [load queue 
path](../configuration/index.html#path-configuration). Each Historical process 
maintains a connection to Zookeeper, watching those paths for segment 
information.
   ```
   What does this knowledge of the internals of segment assignment gain me. 
What am I supposed to do with this knowledge?

##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the 
following topics before
 
 For instructions to configure query caching see [Using query 
caching](./using-caching.md).
 
+Cache monitoring, including the hit rate and number of evictions, is available 
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level 
caching](../design/historical.md) on Historicals.
+
 ## Cache types
 
-Druid supports the following types of caches:
+Druid supports two types of query caching:
 
-- **Per-segment** caching which stores _partial results_ of a query for a 
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results 
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
 
-To avoid returning stale results, Druid invalidates the cache the moment any 
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important 
for `table` datasources that have highly-variable underlying data segments, 
including real-time data segments.
 
-Druid can store cache data on the local JVM heap or in an external distributed 
key/value store. The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage 
defaults to the minimum value of 1 GiB or the ten percent of the maximum 
runtime memory for the JVM with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external 
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache 
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory 
for the JVM, with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.  When using caffeine, the cache is inside 
the JVM heap and is directly measurable.  Heap usage will grow up to the 
maximum configured size, and then the least recently used segment results will 
be evicted and replaced with newer results.
 
 ### Per-segment caching
 
-The primary form of caching in Druid is the **per-segment cache** which stores 
query results on a per-segment basis. It is enabled on Historical services by 
default.
+The primary form of caching in Druid is a *per-segment results cache*.  This 
stores partial query results on a per-segment basis and is enabled on 
Historical services by default.

Review comment:
       ```suggestion
   The primary form of caching in Druid is a *per-segment results cache*.  This 
cache stores partial query results on a per-segment basis and is enabled on 
Historical services by default.
   ```
   nit: avoid "this" as a pronoun

##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
 
 ### Loading and serving segments
 
-Each Historical process maintains a constant connection to Zookeeper and 
watches a configurable set of Zookeeper paths for new segment information. 
Historical processes do not communicate directly with each other or with the 
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep 
Storage](../dependencies/deep-storage.md) to local disk in an area called the 
*segment cache*.  The size and location of the segment cache on each Historical 
process is set using `druid.segmentCache.locations` in 
[configuration](../configuration/index.html#historical-general-configuration).
 
-The [Coordinator](../design/coordinator.md) process is responsible for 
assigning new segments to Historical processes. Assignment is done by creating 
an ephemeral Zookeeper entry under a load queue path associated with a 
Historical process. For more information on how the Coordinator assigns 
segments to Historical processes, please see 
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning 
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
 
-When a Historical process notices a new load queue entry in its load queue 
path, it will first check a local disk directory (cache) for the information 
about segment. If no information about the segment exists in the cache, the 
Historical process will download metadata about the new segment to serve from 
Zookeeper. This metadata includes specifications about where the segment is 
located in deep storage and about how to decompress and process the segment. 
For more information about segment metadata and Druid segments in general, 
please see [Segments](../design/segments.md). Once a Historical process 
completes processing a segment, the segment is announced in Zookeeper under a 
served segments path associated with the process. At this point, the segment is 
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments 
to - and balance between - Historical processes.  
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration; 
Historical processes do not communicate directly with each other, nor do they 
communicate directly with the Coordinator.  Instead, the Coordinator creates 
ephemeral Zookeeper entries under a [load queue 
path](../configuration/index.html#path-configuration) and each Historical 
process maintains a connection to Zookeeper, watching those paths for segment 
information.
+
+For more information about how the Coordinator assigns segments to Historical 
processes, please see [Coordinator](../design/coordinator.html).
+
+When a Historical process sees a new Zookeeper load queue entry, it checks its 
own segment cache. If no information about the segment exists there, the 
Historical process first retrieves metadata from Zookeeper about the segment, 
including where the segment is located in Deep Storage and how it needs to 
decompress and process it.
+
+For more information about segment metadata and Druid segments in general, 
please see [Segments](../design/segments.html). 
+
+Once a Historical process has completed pulling down and processing a segment 
from Deep Storage, the segment is advertised as being available for queries.  
This announcement by the Historical is made via Zookeeper, this time under a 
[served segments path](../configuration/index.html#path-configuration). At this 
point, the segment is considered available for querying by the Broker.

Review comment:
       ```suggestion
   After a Historical process pulls down and processes a segment from Deep 
Storage, Druid advertises the segment as being available for queries from the 
Broker.  This announcement by the Historical is made via Zookeeper, in a 
[served segments path](../configuration/index.html#path-configuration).
   ```
   Why do I need to know that the announcement is going through Zookeper? 

##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
 
 ### Loading and serving segments
 
-Each Historical process maintains a constant connection to Zookeeper and 
watches a configurable set of Zookeeper paths for new segment information. 
Historical processes do not communicate directly with each other or with the 
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep 
Storage](../dependencies/deep-storage.md) to local disk in an area called the 
*segment cache*.  The size and location of the segment cache on each Historical 
process is set using `druid.segmentCache.locations` in 
[configuration](../configuration/index.html#historical-general-configuration).
 
-The [Coordinator](../design/coordinator.md) process is responsible for 
assigning new segments to Historical processes. Assignment is done by creating 
an ephemeral Zookeeper entry under a load queue path associated with a 
Historical process. For more information on how the Coordinator assigns 
segments to Historical processes, please see 
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning 
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
 
-When a Historical process notices a new load queue entry in its load queue 
path, it will first check a local disk directory (cache) for the information 
about segment. If no information about the segment exists in the cache, the 
Historical process will download metadata about the new segment to serve from 
Zookeeper. This metadata includes specifications about where the segment is 
located in deep storage and about how to decompress and process the segment. 
For more information about segment metadata and Druid segments in general, 
please see [Segments](../design/segments.md). Once a Historical process 
completes processing a segment, the segment is announced in Zookeeper under a 
served segments path associated with the process. At this point, the segment is 
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments 
to - and balance between - Historical processes.  
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration; 
Historical processes do not communicate directly with each other, nor do they 
communicate directly with the Coordinator.  Instead, the Coordinator creates 
ephemeral Zookeeper entries under a [load queue 
path](../configuration/index.html#path-configuration) and each Historical 
process maintains a connection to Zookeeper, watching those paths for segment 
information.
+
+For more information about how the Coordinator assigns segments to Historical 
processes, please see [Coordinator](../design/coordinator.html).
+
+When a Historical process sees a new Zookeeper load queue entry, it checks its 
own segment cache. If no information about the segment exists there, the 
Historical process first retrieves metadata from Zookeeper about the segment, 
including where the segment is located in Deep Storage and how it needs to 
decompress and process it.
+
+For more information about segment metadata and Druid segments in general, 
please see [Segments](../design/segments.html). 
+
+Once a Historical process has completed pulling down and processing a segment 
from Deep Storage, the segment is advertised as being available for queries.  
This announcement by the Historical is made via Zookeeper, this time under a 
[served segments path](../configuration/index.html#path-configuration). At this 
point, the segment is considered available for querying by the Broker.
+
+For more information about how the Broker determines what data is available 
for queries, please see [Broker](broker.html).
+
+On startup, a Historical process searches through its segment cache and, in 
order for Historicals to be queried as soon as possible, immediately advertises 
all segments it finds there.

Review comment:
       ```suggestion
   To make data from the segment cache available for querying as soon as 
possible, Historical services search the local segment cache upon startup and 
advertise the segments found there.
   ```

##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the 
following topics before
 
 For instructions to configure query caching see [Using query 
caching](./using-caching.md).
 
+Cache monitoring, including the hit rate and number of evictions, is available 
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level 
caching](../design/historical.md) on Historicals.
+
 ## Cache types
 
-Druid supports the following types of caches:
+Druid supports two types of query caching:
 
-- **Per-segment** caching which stores _partial results_ of a query for a 
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results 
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
 
-To avoid returning stale results, Druid invalidates the cache the moment any 
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important 
for `table` datasources that have highly-variable underlying data segments, 
including real-time data segments.
 
-Druid can store cache data on the local JVM heap or in an external distributed 
key/value store. The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage 
defaults to the minimum value of 1 GiB or the ten percent of the maximum 
runtime memory for the JVM with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external 
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache 
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory 
for the JVM, with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.  When using caffeine, the cache is inside 
the JVM heap and is directly measurable.  Heap usage will grow up to the 
maximum configured size, and then the least recently used segment results will 
be evicted and replaced with newer results.

Review comment:
       Druid can store cache data on the local JVM heap or in an external 
distributed key/value store. For example memcached. The default is a local 
cache based upon [Caffeine](https://github.com/ben-manes/caffeine). The default 
maximum cache storage size is the minimum of 1 GiB / ten percent of maximum 
runtime memory for the JVM, with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage. 
   
   The caffeine-based cache resides within the JVM heap and is directly 
measurable.  Heap usage grows up to the maximum configured size. After that 
Druid flushes the least recently used segment results and replaces them with 
newer results.
   ```
   I think of flushing from a cache not "evicting" maybe I'm confused.

##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the 
following topics before
 
 For instructions to configure query caching see [Using query 
caching](./using-caching.md).
 
+Cache monitoring, including the hit rate and number of evictions, is available 
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level 
caching](../design/historical.md) on Historicals.
+
 ## Cache types
 
-Druid supports the following types of caches:
+Druid supports two types of query caching:
 
-- **Per-segment** caching which stores _partial results_ of a query for a 
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results 
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.

Review comment:
       ```suggestion
   - [Whole-query caching](#whole-query-caching) stores final query results.
   ```
   avoid italics for emphasis. Not sure we need emphasis here.

##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the 
following topics before
 
 For instructions to configure query caching see [Using query 
caching](./using-caching.md).
 
+Cache monitoring, including the hit rate and number of evictions, is available 
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level 
caching](../design/historical.md) on Historicals.
+
 ## Cache types
 
-Druid supports the following types of caches:
+Druid supports two types of query caching:
 
-- **Per-segment** caching which stores _partial results_ of a query for a 
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results 
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
 
-To avoid returning stale results, Druid invalidates the cache the moment any 
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important 
for `table` datasources that have highly-variable underlying data segments, 
including real-time data segments.

Review comment:
       ```suggestion
   Druid invalidates any cache the moment any underlying data change to avoid 
returning stale results. This is especially important for `table` datasources 
that have highly-variable underlying data segments, including real-time data 
segments.
   ```
   This is not an admonition. It's documentation of how the feature works. If 
we need to influence customer behavior, it may be worth it to reiterate it 
elsewhere.

##########
File path: docs/operations/basic-cluster-tuning.md
##########
@@ -90,11 +90,9 @@ Tuning the cluster so that each Historical can accept 50 
queries and 10 non-quer
 
 #### Segment Cache Size
 
-`druid.segmentCache.locations` specifies locations where segment data can be 
stored on the Historical. The sum of available disk space across these 
locations is set as the default value for property: `druid.server.maxSize`, 
which controls the total size of segment data that can be assigned by the 
Coordinator to a Historical.
+Avoid allocating a Historical an excessive amount of segment data.  As the 
value of (`free system memory` / total size of all 
`druid.segmentCache.locations`) increases, a greater proportion of segments can 
be kept in the [memory-mapped segment 
cache](../design/historical.md#loading-and-serving-segments-from-cache), 
allowing for better query performance.

Review comment:
       ```suggestion
   For better query performance, do not allocate segment data to a Historical 
in excess of the system free memory.  When `free system memory` is greater than 
or equal to `druid.segmentCache.locations`, the more segment data the 
Historical can be hold in the memory-mapped segment cache.
   ```

##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the 
following topics before
 
 For instructions to configure query caching see [Using query 
caching](./using-caching.md).
 
+Cache monitoring, including the hit rate and number of evictions, is available 
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level 
caching](../design/historical.md) on Historicals.
+
 ## Cache types
 
-Druid supports the following types of caches:
+Druid supports two types of query caching:
 
-- **Per-segment** caching which stores _partial results_ of a query for a 
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results 
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
 
-To avoid returning stale results, Druid invalidates the cache the moment any 
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important 
for `table` datasources that have highly-variable underlying data segments, 
including real-time data segments.
 
-Druid can store cache data on the local JVM heap or in an external distributed 
key/value store. The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage 
defaults to the minimum value of 1 GiB or the ten percent of the maximum 
runtime memory for the JVM with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external 
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache 
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory 
for the JVM, with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.  When using caffeine, the cache is inside 
the JVM heap and is directly measurable.  Heap usage will grow up to the 
maximum configured size, and then the least recently used segment results will 
be evicted and replaced with newer results.
 
 ### Per-segment caching
 
-The primary form of caching in Druid is the **per-segment cache** which stores 
query results on a per-segment basis. It is enabled on Historical services by 
default.
+The primary form of caching in Druid is a *per-segment results cache*.  This 
stores partial query results on a per-segment basis and is enabled on 
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not 
change, especially important for those segments that 
[historical](../design/historical.html) processes pull into their local 
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as 
instructed by the lead [coordinator](../design/coordinator.html).  Meanwhile, 
real-time segments, on the other hand, continue to have results computed at 
query time.

Review comment:
       ```suggestion
   The per-segment results cache allows Druid to maintain a low-eviction-rate 
cache for segments that do not change, especially important for those segments 
that [historical](../design/historical.html) processes pull into their local 
_segment cache_ from [deep storage](../dependencies/deep-storage.html). 
Real-time segments, on the other hand, continue to have results computed at 
query time.
   ```
   Why does it matter that the Coordinator instructs them to do this? How do I 
use that information in a practical sense?

##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the 
following topics before
 
 For instructions to configure query caching see [Using query 
caching](./using-caching.md).
 
+Cache monitoring, including the hit rate and number of evictions, is available 
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level 
caching](../design/historical.md) on Historicals.
+
 ## Cache types
 
-Druid supports the following types of caches:
+Druid supports two types of query caching:
 
-- **Per-segment** caching which stores _partial results_ of a query for a 
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results 
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
 
-To avoid returning stale results, Druid invalidates the cache the moment any 
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important 
for `table` datasources that have highly-variable underlying data segments, 
including real-time data segments.
 
-Druid can store cache data on the local JVM heap or in an external distributed 
key/value store. The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage 
defaults to the minimum value of 1 GiB or the ten percent of the maximum 
runtime memory for the JVM with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external 
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache 
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory 
for the JVM, with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.  When using caffeine, the cache is inside 
the JVM heap and is directly measurable.  Heap usage will grow up to the 
maximum configured size, and then the least recently used segment results will 
be evicted and replaced with newer results.
 
 ### Per-segment caching
 
-The primary form of caching in Druid is the **per-segment cache** which stores 
query results on a per-segment basis. It is enabled on Historical services by 
default.
+The primary form of caching in Druid is a *per-segment results cache*.  This 
stores partial query results on a per-segment basis and is enabled on 
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not 
change, especially important for those segments that 
[historical](../design/historical.html) processes pull into their local 
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as 
instructed by the lead [coordinator](../design/coordinator.html).  Meanwhile, 
real-time segments, on the other hand, continue to have results computed at 
query time.
 
-When your queries include data from segments that are mutable and undergoing 
real-time ingestion, use a segment cache. In this case Druid caches query 
results for immutable historical segments when possible. It re-computes results 
for the real-time segments at query time.
+Per-segment cached results also have the potential to be merged into the 
results of later queries where there is a similar basic shape (filters, 
aggregations, etc.) yet cover a different period of time, for example.
 
-For example, you have queries that frequently include incoming data from a 
Kafka or Kinesis stream alongside unchanging segments. Per-segment caching lets 
Druid cache results from older immutable segments and merge them with updated 
data. Whole-query caching would not be helpful in this scenario because the new 
data from real-time ingestion continually invalidates the cache.
+Per-segment caching is controlled by the parameters `useCache` and 
`populateCache`.

Review comment:
       Why do we need this here? This should be covered in the how to section.

##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
 
 ### Loading and serving segments
 
-Each Historical process maintains a constant connection to Zookeeper and 
watches a configurable set of Zookeeper paths for new segment information. 
Historical processes do not communicate directly with each other or with the 
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep 
Storage](../dependencies/deep-storage.md) to local disk in an area called the 
*segment cache*.  The size and location of the segment cache on each Historical 
process is set using `druid.segmentCache.locations` in 
[configuration](../configuration/index.html#historical-general-configuration).
 
-The [Coordinator](../design/coordinator.md) process is responsible for 
assigning new segments to Historical processes. Assignment is done by creating 
an ephemeral Zookeeper entry under a load queue path associated with a 
Historical process. For more information on how the Coordinator assigns 
segments to Historical processes, please see 
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning 
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
 
-When a Historical process notices a new load queue entry in its load queue 
path, it will first check a local disk directory (cache) for the information 
about segment. If no information about the segment exists in the cache, the 
Historical process will download metadata about the new segment to serve from 
Zookeeper. This metadata includes specifications about where the segment is 
located in deep storage and about how to decompress and process the segment. 
For more information about segment metadata and Druid segments in general, 
please see [Segments](../design/segments.md). Once a Historical process 
completes processing a segment, the segment is announced in Zookeeper under a 
served segments path associated with the process. At this point, the segment is 
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments 
to - and balance between - Historical processes.  
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration; 
Historical processes do not communicate directly with each other, nor do they 
communicate directly with the Coordinator.  Instead, the Coordinator creates 
ephemeral Zookeeper entries under a [load queue 
path](../configuration/index.html#path-configuration) and each Historical 
process maintains a connection to Zookeeper, watching those paths for segment 
information.
+
+For more information about how the Coordinator assigns segments to Historical 
processes, please see [Coordinator](../design/coordinator.html).
+
+When a Historical process sees a new Zookeeper load queue entry, it checks its 
own segment cache. If no information about the segment exists there, the 
Historical process first retrieves metadata from Zookeeper about the segment, 
including where the segment is located in Deep Storage and how it needs to 
decompress and process it.
+
+For more information about segment metadata and Druid segments in general, 
please see [Segments](../design/segments.html). 
+
+Once a Historical process has completed pulling down and processing a segment 
from Deep Storage, the segment is advertised as being available for queries.  
This announcement by the Historical is made via Zookeeper, this time under a 
[served segments path](../configuration/index.html#path-configuration). At this 
point, the segment is considered available for querying by the Broker.
+
+For more information about how the Broker determines what data is available 
for queries, please see [Broker](broker.html).
+
+On startup, a Historical process searches through its segment cache and, in 
order for Historicals to be queried as soon as possible, immediately advertises 
all segments it finds there.
 
 ### Loading and serving segments from cache
 
-Recall that when a Historical process notices a new segment entry in its load 
queue path, the Historical process first checks a configurable cache directory 
on its local disk to see if the segment had been previously downloaded. If a 
local cache entry already exists, the Historical process will directly read the 
segment binary files from disk and load the segment.
+A technique called [memory mapping](https://en.wikipedia.org/wiki/Mmap) is 
used for the segment cache, consuming memory from the underlying operating 
system so that parts of segment files can be held in memory, increasing query 
performance at the data level.  The in-memory segment cache is therefore 
affected by, for example, the size of the Historical JVM, heap / direct memory 
buffers, and other processes on the operating system itself.
+
+At query time, if the required part of a segment file is available in the 
memory mapped cache (also known as the "page cache"), it will be re-used and 
read directly from memory.  If it is not, that part of the segment will be read 
from disk.  When this happens, there is potential for this new data to evict 
other segment data from memory.  Consequently, the closer that free operating 
system memory is to `druid.server.maxSize`, the faster historical processes 
typically respond at query time since segment data is very likely to be 
available in memory.  Conversely, the lower the free operating system memory, 
the more likely a Historical is to read segments from disk.

Review comment:
       ```suggestion
   At query time, if the required part of a segment file is available in the 
memory mapped cache or "page cache", the Historical re-uses it and reads it 
directly from memory.  If it is not in the memory-mapped cache, the Historical 
reads that part of the segment from disk. In this case, there is potential for 
new data to flush other segment data from memory. This means that if free 
operating system memory is close to `druid.server.maxSize`, the more likely 
that segment data will be available in memory and reduce query times.  
Conversely, the lower the free operating system memory, the more likely a 
Historical is to read segments from disk.
   ```
   This seems like the most important point to me. Like it should be in some 
sort of operational best practice. For example I want to know when I'm setting 
up my system that I want to be able to store as much of the data in memory as 
possible to make queries fast.

##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
 
 ### Loading and serving segments
 
-Each Historical process maintains a constant connection to Zookeeper and 
watches a configurable set of Zookeeper paths for new segment information. 
Historical processes do not communicate directly with each other or with the 
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep 
Storage](../dependencies/deep-storage.md) to local disk in an area called the 
*segment cache*.  The size and location of the segment cache on each Historical 
process is set using `druid.segmentCache.locations` in 
[configuration](../configuration/index.html#historical-general-configuration).
 
-The [Coordinator](../design/coordinator.md) process is responsible for 
assigning new segments to Historical processes. Assignment is done by creating 
an ephemeral Zookeeper entry under a load queue path associated with a 
Historical process. For more information on how the Coordinator assigns 
segments to Historical processes, please see 
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning 
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
 
-When a Historical process notices a new load queue entry in its load queue 
path, it will first check a local disk directory (cache) for the information 
about segment. If no information about the segment exists in the cache, the 
Historical process will download metadata about the new segment to serve from 
Zookeeper. This metadata includes specifications about where the segment is 
located in deep storage and about how to decompress and process the segment. 
For more information about segment metadata and Druid segments in general, 
please see [Segments](../design/segments.md). Once a Historical process 
completes processing a segment, the segment is announced in Zookeeper under a 
served segments path associated with the process. At this point, the segment is 
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments 
to - and balance between - Historical processes.  
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration; 
Historical processes do not communicate directly with each other, nor do they 
communicate directly with the Coordinator.  Instead, the Coordinator creates 
ephemeral Zookeeper entries under a [load queue 
path](../configuration/index.html#path-configuration) and each Historical 
process maintains a connection to Zookeeper, watching those paths for segment 
information.
+
+For more information about how the Coordinator assigns segments to Historical 
processes, please see [Coordinator](../design/coordinator.html).
+
+When a Historical process sees a new Zookeeper load queue entry, it checks its 
own segment cache. If no information about the segment exists there, the 
Historical process first retrieves metadata from Zookeeper about the segment, 
including where the segment is located in Deep Storage and how it needs to 
decompress and process it.
+
+For more information about segment metadata and Druid segments in general, 
please see [Segments](../design/segments.html). 
+
+Once a Historical process has completed pulling down and processing a segment 
from Deep Storage, the segment is advertised as being available for queries.  
This announcement by the Historical is made via Zookeeper, this time under a 
[served segments path](../configuration/index.html#path-configuration). At this 
point, the segment is considered available for querying by the Broker.
+
+For more information about how the Broker determines what data is available 
for queries, please see [Broker](broker.html).
+
+On startup, a Historical process searches through its segment cache and, in 
order for Historicals to be queried as soon as possible, immediately advertises 
all segments it finds there.
 
 ### Loading and serving segments from cache
 
-Recall that when a Historical process notices a new segment entry in its load 
queue path, the Historical process first checks a configurable cache directory 
on its local disk to see if the segment had been previously downloaded. If a 
local cache entry already exists, the Historical process will directly read the 
segment binary files from disk and load the segment.
+A technique called [memory mapping](https://en.wikipedia.org/wiki/Mmap) is 
used for the segment cache, consuming memory from the underlying operating 
system so that parts of segment files can be held in memory, increasing query 
performance at the data level.  The in-memory segment cache is therefore 
affected by, for example, the size of the Historical JVM, heap / direct memory 
buffers, and other processes on the operating system itself.

Review comment:
       ```suggestion
   The segment cache uses [memory mapping](https://en.wikipedia.org/wiki/Mmap). 
The cache consumes memory from the underlying operating system so Historicals 
can hold parts of segment files in memory to increase query performance at the 
data level.  The in-memory segment cache is affected by the size of the 
Historical JVM, heap / direct memory buffers, and other processes on the 
operating system itself.
   ```

##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
 
 ### Loading and serving segments
 
-Each Historical process maintains a constant connection to Zookeeper and 
watches a configurable set of Zookeeper paths for new segment information. 
Historical processes do not communicate directly with each other or with the 
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep 
Storage](../dependencies/deep-storage.md) to local disk in an area called the 
*segment cache*.  The size and location of the segment cache on each Historical 
process is set using `druid.segmentCache.locations` in 
[configuration](../configuration/index.html#historical-general-configuration).
 
-The [Coordinator](../design/coordinator.md) process is responsible for 
assigning new segments to Historical processes. Assignment is done by creating 
an ephemeral Zookeeper entry under a load queue path associated with a 
Historical process. For more information on how the Coordinator assigns 
segments to Historical processes, please see 
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning 
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).

Review comment:
       ```suggestion
   See the [Tuning 
Guide](../operations/basic-cluster-tuning.html#segment-cache-size) for more 
information.
   ```

##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the 
following topics before
 
 For instructions to configure query caching see [Using query 
caching](./using-caching.md).
 
+Cache monitoring, including the hit rate and number of evictions, is available 
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level 
caching](../design/historical.md) on Historicals.
+
 ## Cache types
 
-Druid supports the following types of caches:
+Druid supports two types of query caching:
 
-- **Per-segment** caching which stores _partial results_ of a query for a 
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results 
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
 
-To avoid returning stale results, Druid invalidates the cache the moment any 
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important 
for `table` datasources that have highly-variable underlying data segments, 
including real-time data segments.
 
-Druid can store cache data on the local JVM heap or in an external distributed 
key/value store. The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage 
defaults to the minimum value of 1 GiB or the ten percent of the maximum 
runtime memory for the JVM with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external 
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache 
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory 
for the JVM, with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.  When using caffeine, the cache is inside 
the JVM heap and is directly measurable.  Heap usage will grow up to the 
maximum configured size, and then the least recently used segment results will 
be evicted and replaced with newer results.
 
 ### Per-segment caching
 
-The primary form of caching in Druid is the **per-segment cache** which stores 
query results on a per-segment basis. It is enabled on Historical services by 
default.
+The primary form of caching in Druid is a *per-segment results cache*.  This 
stores partial query results on a per-segment basis and is enabled on 
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not 
change, especially important for those segments that 
[historical](../design/historical.html) processes pull into their local 
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as 
instructed by the lead [coordinator](../design/coordinator.html).  Meanwhile, 
real-time segments, on the other hand, continue to have results computed at 
query time.
 
-When your queries include data from segments that are mutable and undergoing 
real-time ingestion, use a segment cache. In this case Druid caches query 
results for immutable historical segments when possible. It re-computes results 
for the real-time segments at query time.
+Per-segment cached results also have the potential to be merged into the 
results of later queries where there is a similar basic shape (filters, 
aggregations, etc.) yet cover a different period of time, for example.
 
-For example, you have queries that frequently include incoming data from a 
Kafka or Kinesis stream alongside unchanging segments. Per-segment caching lets 
Druid cache results from older immutable segments and merge them with updated 
data. Whole-query caching would not be helpful in this scenario because the new 
data from real-time ingestion continually invalidates the cache.
+Per-segment caching is controlled by the parameters `useCache` and 
`populateCache`.
+
+> **Use per-segment caching with real-time data**
+>
+> For example, you could have queries that address fresh data arriving from 
Kafka, and which additionally covers intervals in segments that are loaded on 
historicals.  Per-segment caching allows Druid to cache results from the 
historical segments confidently, and to merge them with real-time results from 
the stream.  [Whole-query caching](#whole-query-caching), on the other hand, 
would not be helpful in this scenario because new data from real-time ingestion 
will continually invalidate the _entire_ result.
 
 ### Whole-query caching
 
-If real-time ingestion invalidating the cache is not an issue for your 
queries, you can use **whole-query caching** on the Broker to increase query 
efficiency. The Broker performs whole-query caching operations before sending 
fan out queries to Historicals. Therefore Druid no longer needs to merge the 
per-segment results on the Broker.
+Here, entire results of individual queries are cached, meaning Druid no longer 
needs to merge the per-segment results on the Broker.
+
+Whole-query result caching is controlled by the parameters 
`useResultLevelCache` and `populateResultLevelCache` and runtime properties 
`druid.broker.cache.*`.

Review comment:
       This should be covered in the "how to" "Using caching.

##########
File path: docs/operations/basic-cluster-tuning.md
##########
@@ -90,11 +90,9 @@ Tuning the cluster so that each Historical can accept 50 
queries and 10 non-quer
 
 #### Segment Cache Size
 
-`druid.segmentCache.locations` specifies locations where segment data can be 
stored on the Historical. The sum of available disk space across these 
locations is set as the default value for property: `druid.server.maxSize`, 
which controls the total size of segment data that can be assigned by the 
Coordinator to a Historical.
+Avoid allocating a Historical an excessive amount of segment data.  As the 
value of (`free system memory` / total size of all 
`druid.segmentCache.locations`) increases, a greater proportion of segments can 
be kept in the [memory-mapped segment 
cache](../design/historical.md#loading-and-serving-segments-from-cache), 
allowing for better query performance.
 
-Segments are memory-mapped by Historical processes using any available free 
system memory (i.e., memory not used by the Historical JVM and heap/direct 
memory buffers or other processes on the system). Segments that are not 
currently in memory will be paged from disk when queried.
-
-Therefore, the size of cache locations set within 
`druid.segmentCache.locations` should be such that a Historical is not 
allocated an excessive amount of segment data. As the value of (`free system 
memory` / total size of all `druid.segmentCache.locations`) increases, a 
greater proportion of segments can be kept in memory, allowing for better query 
performance. The total segment data size assigned to a Historical can be 
overridden with `druid.server.maxSize`, but this is not required for most of 
the use cases.
+The total segment data size assigned to a Historical is calculated using 
`druid.segmentCache.locations` but can be overridden with 
`druid.server.maxSize`.  This is not required for most of the use cases.

Review comment:
       ```suggestion
   Druid uses the `druid.segmentCache.locations` to calculate the total segment 
data size assigned to a Historical. For some rarer use cases, you can override 
this behavior with `druid.server.maxSize` property.
   ```

##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the 
following topics before
 
 For instructions to configure query caching see [Using query 
caching](./using-caching.md).
 
+Cache monitoring, including the hit rate and number of evictions, is available 
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level 
caching](../design/historical.md) on Historicals.
+
 ## Cache types
 
-Druid supports the following types of caches:
+Druid supports two types of query caching:
 
-- **Per-segment** caching which stores _partial results_ of a query for a 
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results 
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
 
-To avoid returning stale results, Druid invalidates the cache the moment any 
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important 
for `table` datasources that have highly-variable underlying data segments, 
including real-time data segments.
 
-Druid can store cache data on the local JVM heap or in an external distributed 
key/value store. The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage 
defaults to the minimum value of 1 GiB or the ten percent of the maximum 
runtime memory for the JVM with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external 
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache 
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory 
for the JVM, with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.  When using caffeine, the cache is inside 
the JVM heap and is directly measurable.  Heap usage will grow up to the 
maximum configured size, and then the least recently used segment results will 
be evicted and replaced with newer results.
 
 ### Per-segment caching
 
-The primary form of caching in Druid is the **per-segment cache** which stores 
query results on a per-segment basis. It is enabled on Historical services by 
default.
+The primary form of caching in Druid is a *per-segment results cache*.  This 
stores partial query results on a per-segment basis and is enabled on 
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not 
change, especially important for those segments that 
[historical](../design/historical.html) processes pull into their local 
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as 
instructed by the lead [coordinator](../design/coordinator.html).  Meanwhile, 
real-time segments, on the other hand, continue to have results computed at 
query time.
 
-When your queries include data from segments that are mutable and undergoing 
real-time ingestion, use a segment cache. In this case Druid caches query 
results for immutable historical segments when possible. It re-computes results 
for the real-time segments at query time.
+Per-segment cached results also have the potential to be merged into the 
results of later queries where there is a similar basic shape (filters, 
aggregations, etc.) yet cover a different period of time, for example.

Review comment:
       ```suggestion
   Druid may potentially merge per-segment cached results with the results of 
later queries that use a similar basic shape with similar filters, 
aggregations, etc. For example, if the query is identical except that it covers 
a different time period.

##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the 
following topics before
 
 For instructions to configure query caching see [Using query 
caching](./using-caching.md).
 
+Cache monitoring, including the hit rate and number of evictions, is available 
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level 
caching](../design/historical.md) on Historicals.
+
 ## Cache types
 
-Druid supports the following types of caches:
+Druid supports two types of query caching:
 
-- **Per-segment** caching which stores _partial results_ of a query for a 
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results 
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
 
-To avoid returning stale results, Druid invalidates the cache the moment any 
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important 
for `table` datasources that have highly-variable underlying data segments, 
including real-time data segments.
 
-Druid can store cache data on the local JVM heap or in an external distributed 
key/value store. The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage 
defaults to the minimum value of 1 GiB or the ten percent of the maximum 
runtime memory for the JVM with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external 
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache 
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory 
for the JVM, with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.  When using caffeine, the cache is inside 
the JVM heap and is directly measurable.  Heap usage will grow up to the 
maximum configured size, and then the least recently used segment results will 
be evicted and replaced with newer results.
 
 ### Per-segment caching
 
-The primary form of caching in Druid is the **per-segment cache** which stores 
query results on a per-segment basis. It is enabled on Historical services by 
default.
+The primary form of caching in Druid is a *per-segment results cache*.  This 
stores partial query results on a per-segment basis and is enabled on 
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not 
change, especially important for those segments that 
[historical](../design/historical.html) processes pull into their local 
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as 
instructed by the lead [coordinator](../design/coordinator.html).  Meanwhile, 
real-time segments, on the other hand, continue to have results computed at 
query time.
 
-When your queries include data from segments that are mutable and undergoing 
real-time ingestion, use a segment cache. In this case Druid caches query 
results for immutable historical segments when possible. It re-computes results 
for the real-time segments at query time.
+Per-segment cached results also have the potential to be merged into the 
results of later queries where there is a similar basic shape (filters, 
aggregations, etc.) yet cover a different period of time, for example.
 
-For example, you have queries that frequently include incoming data from a 
Kafka or Kinesis stream alongside unchanging segments. Per-segment caching lets 
Druid cache results from older immutable segments and merge them with updated 
data. Whole-query caching would not be helpful in this scenario because the new 
data from real-time ingestion continually invalidates the cache.
+Per-segment caching is controlled by the parameters `useCache` and 
`populateCache`.
+
+> **Use per-segment caching with real-time data**
+>
+> For example, you could have queries that address fresh data arriving from 
Kafka, and which additionally covers intervals in segments that are loaded on 
historicals.  Per-segment caching allows Druid to cache results from the 
historical segments confidently, and to merge them with real-time results from 
the stream.  [Whole-query caching](#whole-query-caching), on the other hand, 
would not be helpful in this scenario because new data from real-time ingestion 
will continually invalidate the _entire_ result.

Review comment:
       ```suggestion
   Use per-segment caching with real-time data. For example, your queries 
request data actively arriving from Kafka alongside intervals in segments that 
are loaded on Historicals.  Druid can merge cached results from Historical 
segments with real-time results from the stream.  [Whole-query 
caching](#whole-query-caching), on the other hand, is not helpful in this 
scenario because new data from real-time ingestion will continually invalidate 
the entire cached result.
   ```

##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
 
 ### Loading and serving segments
 
-Each Historical process maintains a constant connection to Zookeeper and 
watches a configurable set of Zookeeper paths for new segment information. 
Historical processes do not communicate directly with each other or with the 
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep 
Storage](../dependencies/deep-storage.md) to local disk in an area called the 
*segment cache*.  The size and location of the segment cache on each Historical 
process is set using `druid.segmentCache.locations` in 
[configuration](../configuration/index.html#historical-general-configuration).
 
-The [Coordinator](../design/coordinator.md) process is responsible for 
assigning new segments to Historical processes. Assignment is done by creating 
an ephemeral Zookeeper entry under a load queue path associated with a 
Historical process. For more information on how the Coordinator assigns 
segments to Historical processes, please see 
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning 
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
 
-When a Historical process notices a new load queue entry in its load queue 
path, it will first check a local disk directory (cache) for the information 
about segment. If no information about the segment exists in the cache, the 
Historical process will download metadata about the new segment to serve from 
Zookeeper. This metadata includes specifications about where the segment is 
located in deep storage and about how to decompress and process the segment. 
For more information about segment metadata and Druid segments in general, 
please see [Segments](../design/segments.md). Once a Historical process 
completes processing a segment, the segment is announced in Zookeeper under a 
served segments path associated with the process. At this point, the segment is 
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments 
to - and balance between - Historical processes.  
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration; 
Historical processes do not communicate directly with each other, nor do they 
communicate directly with the Coordinator.  Instead, the Coordinator creates 
ephemeral Zookeeper entries under a [load queue 
path](../configuration/index.html#path-configuration) and each Historical 
process maintains a connection to Zookeeper, watching those paths for segment 
information.
+
+For more information about how the Coordinator assigns segments to Historical 
processes, please see [Coordinator](../design/coordinator.html).
+
+When a Historical process sees a new Zookeeper load queue entry, it checks its 
own segment cache. If no information about the segment exists there, the 
Historical process first retrieves metadata from Zookeeper about the segment, 
including where the segment is located in Deep Storage and how it needs to 
decompress and process it.
+
+For more information about segment metadata and Druid segments in general, 
please see [Segments](../design/segments.html). 

Review comment:
       ```suggestion
   For more information about segment metadata and Druid segments in general, 
see [Segments](../design/segments.html). 
   ```

##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the 
following topics before
 
 For instructions to configure query caching see [Using query 
caching](./using-caching.md).
 
+Cache monitoring, including the hit rate and number of evictions, is available 
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level 
caching](../design/historical.md) on Historicals.
+
 ## Cache types
 
-Druid supports the following types of caches:
+Druid supports two types of query caching:
 
-- **Per-segment** caching which stores _partial results_ of a query for a 
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results 
for a specific segment (enabled by default).

Review comment:
       ```suggestion
   - [Per-segment caching](#per-segment-caching) stores partial query results 
for a specific segment. It is enabled by default.
   ```
   avoid parens

##########
File path: docs/querying/caching.md
##########
@@ -69,9 +89,11 @@ For instance, whole-query caching is a good option when you 
have queries that in
 
 - On Brokers for small production clusters with less than five servers. 
 
-     Do not use per-segment caches on the Broker for large production 
clusters. When `druid.broker.cache.populateCache` is `true` and query context 
parameter `populateCache` _is not_ `false`, Historicals return results on a 
per-segment basis without merging results locally thus negatively impacting 
cluster scalability.
+> **Avoid using per-segment cache at the Broker for large production clusters**
+>
+> When the Broker cache is enabled (`druid.broker.cache.populateCache` is 
`true`) and `populateCache` _is not_ `false` in the [query 
context](../querying/query-context.html), individual Historicals will _not_ 
merge individual segment-level results, and instead pass these back to the lead 
Broker.  The Broker must then carry out a large merge from _all_ segments on 
its own.

Review comment:
       ```suggestion
   Avoid using per-segment cache at the Broker for large production clusters. 
When the Broker cache is enabled (`druid.broker.cache.populateCache` is `true`) 
and `populateCache` _is not_ `false` in the [query 
context](../querying/query-context.html), individual Historicals will _not_ 
merge individual segment-level results, and instead pass these back to the lead 
Broker.  The Broker must then carry out a large merge from _all_ segments on 
its own.
   ```

##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the 
following topics before
 
 For instructions to configure query caching see [Using query 
caching](./using-caching.md).
 
+Cache monitoring, including the hit rate and number of evictions, is available 
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level 
caching](../design/historical.md) on Historicals.
+
 ## Cache types
 
-Druid supports the following types of caches:
+Druid supports two types of query caching:
 
-- **Per-segment** caching which stores _partial results_ of a query for a 
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results 
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
 
-To avoid returning stale results, Druid invalidates the cache the moment any 
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important 
for `table` datasources that have highly-variable underlying data segments, 
including real-time data segments.
 
-Druid can store cache data on the local JVM heap or in an external distributed 
key/value store. The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage 
defaults to the minimum value of 1 GiB or the ten percent of the maximum 
runtime memory for the JVM with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external 
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache 
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory 
for the JVM, with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.  When using caffeine, the cache is inside 
the JVM heap and is directly measurable.  Heap usage will grow up to the 
maximum configured size, and then the least recently used segment results will 
be evicted and replaced with newer results.
 
 ### Per-segment caching
 
-The primary form of caching in Druid is the **per-segment cache** which stores 
query results on a per-segment basis. It is enabled on Historical services by 
default.
+The primary form of caching in Druid is a *per-segment results cache*.  This 
stores partial query results on a per-segment basis and is enabled on 
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not 
change, especially important for those segments that 
[historical](../design/historical.html) processes pull into their local 
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as 
instructed by the lead [coordinator](../design/coordinator.html).  Meanwhile, 
real-time segments, on the other hand, continue to have results computed at 
query time.
 
-When your queries include data from segments that are mutable and undergoing 
real-time ingestion, use a segment cache. In this case Druid caches query 
results for immutable historical segments when possible. It re-computes results 
for the real-time segments at query time.
+Per-segment cached results also have the potential to be merged into the 
results of later queries where there is a similar basic shape (filters, 
aggregations, etc.) yet cover a different period of time, for example.
 
-For example, you have queries that frequently include incoming data from a 
Kafka or Kinesis stream alongside unchanging segments. Per-segment caching lets 
Druid cache results from older immutable segments and merge them with updated 
data. Whole-query caching would not be helpful in this scenario because the new 
data from real-time ingestion continually invalidates the cache.
+Per-segment caching is controlled by the parameters `useCache` and 
`populateCache`.
+
+> **Use per-segment caching with real-time data**
+>
+> For example, you could have queries that address fresh data arriving from 
Kafka, and which additionally covers intervals in segments that are loaded on 
historicals.  Per-segment caching allows Druid to cache results from the 
historical segments confidently, and to merge them with real-time results from 
the stream.  [Whole-query caching](#whole-query-caching), on the other hand, 
would not be helpful in this scenario because new data from real-time ingestion 
will continually invalidate the _entire_ result.
 
 ### Whole-query caching
 
-If real-time ingestion invalidating the cache is not an issue for your 
queries, you can use **whole-query caching** on the Broker to increase query 
efficiency. The Broker performs whole-query caching operations before sending 
fan out queries to Historicals. Therefore Druid no longer needs to merge the 
per-segment results on the Broker.
+Here, entire results of individual queries are cached, meaning Druid no longer 
needs to merge the per-segment results on the Broker.

Review comment:
       ```suggestion
   With *whole-query caching*, Druid caches the entire results of individual 
queries. In this case the Broker no longer needs to merge the per-segment 
results.
   ```
   

##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
 
 ### Loading and serving segments
 
-Each Historical process maintains a constant connection to Zookeeper and 
watches a configurable set of Zookeeper paths for new segment information. 
Historical processes do not communicate directly with each other or with the 
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep 
Storage](../dependencies/deep-storage.md) to local disk in an area called the 
*segment cache*.  The size and location of the segment cache on each Historical 
process is set using `druid.segmentCache.locations` in 
[configuration](../configuration/index.html#historical-general-configuration).

Review comment:
       ```suggestion
   Each Historical process copies or "pulls" segment files from Deep Storage to 
local disk in an area called the *segment cache*.  Set the 
`druid.segmentCache.locations` to configure the size and location of the 
segment cache on each Historical process. See [Historical general 
configuration](../configuration/index.html#historical-general-configuration).
   ```
   

##########
File path: docs/querying/caching.md
##########
@@ -32,30 +32,50 @@ If you're unfamiliar with Druid architecture, review the 
following topics before
 
 For instructions to configure query caching see [Using query 
caching](./using-caching.md).
 
+Cache monitoring, including the hit rate and number of evictions, is available 
in [Druid metrics](../operations/metrics.html#cache).
+
+Query-level caching is in addition to [data-level 
caching](../design/historical.md) on Historicals.
+
 ## Cache types
 
-Druid supports the following types of caches:
+Druid supports two types of query caching:
 
-- **Per-segment** caching which stores _partial results_ of a query for a 
specific segment. Per-segment caching is enabled on Historicals by default.
-- **Whole-query** caching which stores all results for a query.
+- [Per-segment caching](#per-segment-caching) stores _partial_ query results 
for a specific segment (enabled by default).
+- [Whole-query caching](#whole-query-caching) stores _final_ query results.
 
-To avoid returning stale results, Druid invalidates the cache the moment any 
underlying data changes for both types of cache.
+> **Druid invalidates _any_ cache the moment any underlying data changes**
+>
+> This ensures that Druid does not return stale results, especially important 
for `table` datasources that have highly-variable underlying data segments, 
including real-time data segments.
 
-Druid can store cache data on the local JVM heap or in an external distributed 
key/value store. The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). Maximum cache storage 
defaults to the minimum value of 1 GiB or the ten percent of the maximum 
runtime memory for the JVM with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.
+> **Druid can store cache data on the local JVM heap or in an external 
distributed key/value store (e.g. memcached)**
+>
+> The default is a local cache based upon 
[Caffeine](https://github.com/ben-manes/caffeine). The default maximum cache 
storage size is the minimum of 1 GiB / ten percent of maximum runtime memory 
for the JVM, with no cache expiration. See [Cache 
configuration](../configuration/index.md#cache-configuration) for information 
on how to configure cache storage.  When using caffeine, the cache is inside 
the JVM heap and is directly measurable.  Heap usage will grow up to the 
maximum configured size, and then the least recently used segment results will 
be evicted and replaced with newer results.
 
 ### Per-segment caching
 
-The primary form of caching in Druid is the **per-segment cache** which stores 
query results on a per-segment basis. It is enabled on Historical services by 
default.
+The primary form of caching in Druid is a *per-segment results cache*.  This 
stores partial query results on a per-segment basis and is enabled on 
Historical services by default.
+
+It allows Druid to maintain a low-eviction-rate cache for segments that do not 
change, especially important for those segments that 
[historical](../design/historical.html) processes pull into their local 
_segment cache_ from [deep storage](../dependencies/deep-storage.html) as 
instructed by the lead [coordinator](../design/coordinator.html).  Meanwhile, 
real-time segments, on the other hand, continue to have results computed at 
query time.
 
-When your queries include data from segments that are mutable and undergoing 
real-time ingestion, use a segment cache. In this case Druid caches query 
results for immutable historical segments when possible. It re-computes results 
for the real-time segments at query time.
+Per-segment cached results also have the potential to be merged into the 
results of later queries where there is a similar basic shape (filters, 
aggregations, etc.) yet cover a different period of time, for example.
 
-For example, you have queries that frequently include incoming data from a 
Kafka or Kinesis stream alongside unchanging segments. Per-segment caching lets 
Druid cache results from older immutable segments and merge them with updated 
data. Whole-query caching would not be helpful in this scenario because the new 
data from real-time ingestion continually invalidates the cache.
+Per-segment caching is controlled by the parameters `useCache` and 
`populateCache`.
+
+> **Use per-segment caching with real-time data**
+>
+> For example, you could have queries that address fresh data arriving from 
Kafka, and which additionally covers intervals in segments that are loaded on 
historicals.  Per-segment caching allows Druid to cache results from the 
historical segments confidently, and to merge them with real-time results from 
the stream.  [Whole-query caching](#whole-query-caching), on the other hand, 
would not be helpful in this scenario because new data from real-time ingestion 
will continually invalidate the _entire_ result.
 
 ### Whole-query caching
 
-If real-time ingestion invalidating the cache is not an issue for your 
queries, you can use **whole-query caching** on the Broker to increase query 
efficiency. The Broker performs whole-query caching operations before sending 
fan out queries to Historicals. Therefore Druid no longer needs to merge the 
per-segment results on the Broker.
+Here, entire results of individual queries are cached, meaning Druid no longer 
needs to merge the per-segment results on the Broker.
+
+Whole-query result caching is controlled by the parameters 
`useResultLevelCache` and `populateResultLevelCache` and runtime properties 
`druid.broker.cache.*`.
 
-For instance, whole-query caching is a good option when you have queries that 
include data from a batch ingestion task that runs every few hours or once a 
day. Per-segment caching would be less efficient in this case because it 
requires Druid to merge the per-segment results for each query, even when the 
results are cached.
+> **Use whole-query caching when segments are stable and you are not using 
real-time ingestion**
+>
+> If real-time ingestion invalidating the cache is not an issue for your 
queries, you can use *whole-query caching* on the Broker to increase query 
efficiency. The Broker performs whole-query caching operations before sending 
fan out queries to Historicals.
+>
+> For instance, whole-query caching is a good option when you have queries 
that include data from a batch ingestion task that runs every few hours or once 
a day. [Per-segment caching](#per-segment-caching) would be less efficient in 
this case because it requires Druid to merge the per-segment results for each 
query, even when the results are cached.

Review comment:
       ```suggestion
   Use whole-query caching when segments are stable and you are not using 
real-time ingestion. If real-time ingestion invalidating the cache is not an 
issue for your queries, you can use *whole-query caching* on the Broker to 
increase query efficiency.
   
   For instance, whole-query caching is a good option when you have queries 
that include data from a batch ingestion task that runs every few hours or once 
a day. [Per-segment caching](#per-segment-caching) would be less efficient in 
this case because it requires Druid to merge the per-segment results for each 
query, even when the results are cached.
   ```
   Why do I care that the Broker performs the caching before the fan out? How 
does that modify my behavior? 

##########
File path: docs/design/historical.md
##########
@@ -39,17 +39,31 @@ org.apache.druid.cli.Main server historical
 
 ### Loading and serving segments
 
-Each Historical process maintains a constant connection to Zookeeper and 
watches a configurable set of Zookeeper paths for new segment information. 
Historical processes do not communicate directly with each other or with the 
Coordinator processes but instead rely on Zookeeper for coordination.
+Each Historical process copies ("pulls") segment files from [Deep 
Storage](../dependencies/deep-storage.md) to local disk in an area called the 
*segment cache*.  The size and location of the segment cache on each Historical 
process is set using `druid.segmentCache.locations` in 
[configuration](../configuration/index.html#historical-general-configuration).
 
-The [Coordinator](../design/coordinator.md) process is responsible for 
assigning new segments to Historical processes. Assignment is done by creating 
an ephemeral Zookeeper entry under a load queue path associated with a 
Historical process. For more information on how the Coordinator assigns 
segments to Historical processes, please see 
[Coordinator](../design/coordinator.md).
+For more information on tuning this value, see the [Tuning 
Guide](../operations/basic-cluster-tuning.html#segment-cache-size).
 
-When a Historical process notices a new load queue entry in its load queue 
path, it will first check a local disk directory (cache) for the information 
about segment. If no information about the segment exists in the cache, the 
Historical process will download metadata about the new segment to serve from 
Zookeeper. This metadata includes specifications about where the segment is 
located in deep storage and about how to decompress and process the segment. 
For more information about segment metadata and Druid segments in general, 
please see [Segments](../design/segments.md). Once a Historical process 
completes processing a segment, the segment is announced in Zookeeper under a 
served segments path associated with the process. At this point, the segment is 
available for querying.
+The [Coordinator](../design/coordinator.html) leads the assignment of segments 
to - and balance between - Historical processes.  
[Zookeeper](../dependencies/zookeeper.md) is central to this collaboration; 
Historical processes do not communicate directly with each other, nor do they 
communicate directly with the Coordinator.  Instead, the Coordinator creates 
ephemeral Zookeeper entries under a [load queue 
path](../configuration/index.html#path-configuration) and each Historical 
process maintains a connection to Zookeeper, watching those paths for segment 
information.
+
+For more information about how the Coordinator assigns segments to Historical 
processes, please see [Coordinator](../design/coordinator.html).
+
+When a Historical process sees a new Zookeeper load queue entry, it checks its 
own segment cache. If no information about the segment exists there, the 
Historical process first retrieves metadata from Zookeeper about the segment, 
including where the segment is located in Deep Storage and how it needs to 
decompress and process it.

Review comment:
       ```suggestion
   When a Historical process detects a new entry in the Zookeeper load queue, 
it checks its own segment cache. If no information about the segment exists 
there, the Historical process first retrieves metadata from Zookeeper about the 
segment, including where the segment is located in Deep Storage and how it 
needs to decompress and process it.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] techdocsmith commented on a change in pull request #11584: Docs - query caching

Reply via email to