[incubator-druid] branch 0.14.0-incubating updated: Improve doc for auto compaction (#7117) (#7190)

fjy Mon, 04 Mar 2019 22:18:56 -0800

This is an automated email from the ASF dual-hosted git repository.

fjy pushed a commit to branch 0.14.0-incubating
in repository https://gitbox.apache.org/repos/asf/incubator-druid.git



The following commit(s) were added to refs/heads/0.14.0-incubating by this push:
     new cb4227d  Improve doc for auto compaction (#7117) (#7190)
cb4227d is described below

commit cb4227dc0c84c6e67f997b5f18615e65067d41dd
Author: Jonathan Wei <[email protected]>
AuthorDate: Mon Mar 4 22:18:29 2019 -0800

    Improve doc for auto compaction (#7117) (#7190)
    
    * Improve doc for auto compaction
    
    * fix doc
    
    * address comments
---
 docs/content/configuration/index.md             |  6 +-
 docs/content/design/coordinator.md              |  2 +-
 docs/content/operations/api-reference.md        |  3 +-
 docs/content/operations/segment-optimization.md | 74 +++++++++++++++++++++----
 4 files changed, 71 insertions(+), 14 deletions(-)

diff --git a/docs/content/configuration/index.md 
b/docs/content/configuration/index.md
index 2486ff4..858c2a2 100644
--- a/docs/content/configuration/index.md
+++ b/docs/content/configuration/index.md
@@ -832,8 +832,10 @@ These configuration options control the behavior of the 
Lookup dynamic configura
 
 ##### Compaction Dynamic Configuration
 
-Compaction configurations can also be set or updated dynamically without 
restarting Coordinators. For segment compaction,
-please see [Compacting 
Segments](../design/coordinator.html#compacting-segments).
+Compaction configurations can also be set or updated dynamically using
+[Coordinator's API](../operations/api-reference.html#compaction-configuration) 
without restarting Coordinators.
+
+For details about segment compaction, please check [Segment Size 
Optimization](../operations/segment-optimization.html).
 
 A description of the compaction config is:
 
diff --git a/docs/content/design/coordinator.md 
b/docs/content/design/coordinator.md
index efb9e9c..9571f3a 100644
--- a/docs/content/design/coordinator.md
+++ b/docs/content/design/coordinator.md
@@ -66,7 +66,7 @@ To ensure an even distribution of segments across Historical 
processes in the cl
 ### Compacting Segments
 
 Each run, the Druid Coordinator compacts small segments abutting each other. 
This is useful when you have a lot of small
-segments which may degrade the query performance as well as increasing the 
disk space usage.
+segments which may degrade query performance as well as increase disk space 
usage. See [Segment Size Optimization](../operations/segment-optimization.html) 
for details.
 
 The Coordinator first finds the segments to compact together based on the 
[segment search policy](#segment-search-policy).
 Once some segments are found, it launches a [compaction 
task](../ingestion/tasks.html#compaction-task) to compact those segments.
diff --git a/docs/content/operations/api-reference.md 
b/docs/content/operations/api-reference.md
index 2417ead..5ad0e6c 100644
--- a/docs/content/operations/api-reference.md
+++ b/docs/content/operations/api-reference.md
@@ -341,7 +341,8 @@ will be set for them.
 
 * `/druid/coordinator/v1/config/compaction`
 
-Creates or updates the compaction config for a dataSource. See [Compaction 
Configuration](../configuration/index.html#compaction-dynamic-configuration) 
for configuration details.
+Creates or updates the compaction config for a dataSource.
+See [Compaction 
Configuration](../configuration/index.html#compaction-dynamic-configuration) 
for configuration details.
 
 
 ##### DELETE
diff --git a/docs/content/operations/segment-optimization.md 
b/docs/content/operations/segment-optimization.md
index d13e4ba..179b418 100644
--- a/docs/content/operations/segment-optimization.md
+++ b/docs/content/operations/segment-optimization.md
@@ -1,6 +1,6 @@
 ---
 layout: doc_page
-title: "Segment size optimization"
+title: "Segment Size Optimization"
 ---
 
 <!--
@@ -22,25 +22,79 @@ title: "Segment size optimization"
   ~ under the License.
   -->
 
-# Segment size optimization
+# Segment Size Optimization
 
 In Druid, it's important to optimize the segment size because
 
   1. Druid stores data in segments. If you're using the [best-effort 
roll-up](../design/index.html#roll-up-modes) mode,
   increasing the segment size might introduce further aggregation which 
reduces the dataSource size.
-  2. When a query is submitted, that query is distributed to all Historicals 
and realtimes
-  which hold the input segments of the query. Each process has a processing 
threads pool and use one thread per segment to
-  process it. If the segment size is too large, data might not be well 
distributed over the
-  whole cluster, thereby decreasing the degree of parallelism. If the segment 
size is too small,
-  each processing thread processes too small data. This might reduce the 
processing speed of other queries as well as
-  the input query itself because the processing threads are shared for 
executing all queries.
+  2. When a query is submitted, that query is distributed to all Historicals 
and realtime tasks
+  which hold the input segments of the query. Each process and task picks a 
thread from its own processing thread pool
+  to process a single segment. If segment sizes are too large, data might not 
be well distributed between data
+  servers, decreasing the degree of parallelism possible during query 
processing.
+  At the other extreme where segment sizes are too small, the scheduling
+  overhead of processing a larger number of segments per query can reduce
+  performance, as the threads that process each segment compete for the fixed
+  slots of the processing pool.
 
 It would be best if you can optimize the segment size at ingestion time, but 
sometimes it's not easy
-especially for the streaming ingestion because the amount of data ingested 
might vary over time. In this case,
-you can roughly set the segment size at ingestion time and optimize it later. 
You have two options:
+especially when it comes to stream ingestion because the amount of data 
ingested might vary over time. In this case,
+you can create segments with a sub-optimzed size first and optimize them later.
+
+You may need to consider the followings to optimize your segments.
+
+  - Number of rows per segment: it's generally recommended for each segment to 
have around 5 million rows.
+  This setting is usually _more_ important than the below "segment byte size".
+  This is because Druid uses a single thread to process each segment,
+  and thus this setting can directly control how many rows each thread 
processes,
+  which in turn means how well the query execution is parallelized.
+  - Segment byte size: it's recommended to set 300 ~ 700MB. If this value
+  doesn't match with the "number of rows per segment", please consider 
optimizing
+  number of rows per segment rather than this value.
+
+<div class="note">
+The above recommendation works in general, but the optimal setting can
+vary based on your workload. For example, if most of your queries
+are heavy and take a long time to process each row, you may want to make
+segments smaller so that the query processing can be more parallelized.
+If you still see some performance issue after optimizing segment size,
+you may need to find the optimal settings for your workload.
+</div>
+
+There might be several ways to check if the compaction is necessary. One way
+is using the [System Schema](../querying/sql.html#system-schema). The
+system schema provides several tables about the current system status 
including the `segments` table.
+By running the below query, you can get the average number of rows and average 
size for published segments.
+
+```sql
+SELECT
+  "start",
+  "end",
+  version,
+  COUNT(*) AS num_segments,
+  AVG("num_rows") AS avg_num_rows,
+  SUM("num_rows") AS total_num_rows,
+  AVG("size") AS avg_size,
+  SUM("size") AS total_size
+FROM
+  sys.segments A
+WHERE
+  datasource = 'your_dataSource' AND
+  is_published = 1
+GROUP BY 1, 2, 3
+ORDER BY 1, 2, 3 DESC;
+```
+
+Please note that the query result might include overshadowed segments.
+In this case, you may want to see only rows of the max version per interval 
(pair of `start` and `end`).
+
+Once you find your segments need compaction, you can consider the below two 
options:
 
   - Turning on the [automatic compaction of 
Coordinators](../design/coordinator.html#compacting-segments).
   The Coordinator periodically submits [compaction 
tasks](../ingestion/tasks.html#compaction-task) to re-index small segments.
+  To enable the automatic compaction, you need to configure it for each 
dataSource via Coordinator's dynamic configuration.
+  See [Compaction Configuration 
API](../operations/api-reference.html#compaction-configuration)
+  and [Compaction 
Configuration](../configuration/index.html#compaction-dynamic-configuration) 
for details.
   - Running periodic Hadoop batch ingestion jobs and using a `dataSource`
   inputSpec to read from the segments generated by the Kafka indexing tasks. 
This might be helpful if you want to compact a lot of segments in parallel.
   Details on how to do this can be found under ['Updating Existing 
Data'](../ingestion/update-existing-data.html).


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[incubator-druid] branch 0.14.0-incubating updated: Improve doc for auto compaction (#7117) (#7190)

Reply via email to