[02/31] drill git commit: coordinate with Bridget's draft of Perf Tuning

tshiran Mon, 18 May 2015 16:37:27 -0700

coordinate with Bridget's draft of Perf Tuning


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/64910ec6
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/64910ec6
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/64910ec6

Branch: refs/heads/gh-pages
Commit: 64910ec60b5446533da460be9896fa74cef87cd0
Parents: 6c87ff0
Author: Kristine Hahn <[email protected]>
Authored: Sun May 17 11:23:57 2015 -0700
Committer: Kristine Hahn <[email protected]>
Committed: Sun May 17 11:23:57 2015 -0700

----------------------------------------------------------------------
 ...guring-a-multitenant-cluster-introduction.md |  4 +-
 .../050-configuring-multitenant-resources.md    |  6 ++-
 .../060-configuring-a-shared-drillbit.md        | 32 ++++++++--------
 .../010-configuration-options-introduction.md   |  8 ++--
 .../030-planning-and-exececution-options.md     | 40 +-------------------
 5 files changed, 29 insertions(+), 61 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/64910ec6/_docs/configure-drill/030-configuring-a-multitenant-cluster-introduction.md
----------------------------------------------------------------------
diff --git 
a/_docs/configure-drill/030-configuring-a-multitenant-cluster-introduction.md 
b/_docs/configure-drill/030-configuring-a-multitenant-cluster-introduction.md
index 80edfc8..c6a272f 100644
--- 
a/_docs/configure-drill/030-configuring-a-multitenant-cluster-introduction.md
+++ 
b/_docs/configure-drill/030-configuring-a-multitenant-cluster-introduction.md
@@ -15,6 +15,6 @@ You need to plan and configure the following resources for 
use with Drill and ot
 
 * [Memory]({{site.baseurl}}/docs/configuring-multitenant-resources)  
 * 
[CPU]({{site.baseurl}}/docs/configuring-multitenant-resources#how-to-manage-drill-cpu-resources)
  
-* Disk  
+* 
[Disk]({{site.baseurl}}/docs/configuring-multitenant-resources#how-to-manage-drill-disk-resources)
 
 
-When users share a Drillbit, [configure 
queues]({{site.baseurl}}/docs/configuring-resources-for-a-shared-drillbit#configuring-query-queuing)
 and 
[parallelization]({{site.baseurl}}/docs/configuring-resources-for-a-shared-drillbit#configuring-parallelization)
 in addition to memory.
\ No newline at end of file
+When users share a Drillbit, [configure 
queues]({{site.baseurl}}/docs/configuring-resources-for-a-shared-drillbit#configuring-query-queuing)
 and 
[parallelization]({{site.baseurl}}/docs/configuring-resources-for-a-shared-drillbit#configuring-parallelization)
 in addition to memory. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/drill/blob/64910ec6/_docs/configure-drill/050-configuring-multitenant-resources.md
----------------------------------------------------------------------
diff --git a/_docs/configure-drill/050-configuring-multitenant-resources.md 
b/_docs/configure-drill/050-configuring-multitenant-resources.md
index fcdab5c..c54f430 100644
--- a/_docs/configure-drill/050-configuring-multitenant-resources.md
+++ b/_docs/configure-drill/050-configuring-multitenant-resources.md
@@ -33,4 +33,8 @@ Configure NodeManager and ResourceManager to reconfigure the 
total memory requir
 Modify MapReduce memory to suit your application needs. Remaining memory is 
typically given to YARN applications. 
 
 ## How to Manage Drill CPU Resources
-Currently, you do not manage CPU resources within Drill. [Use Linux 
`cgroups`](http://en.wikipedia.org/wiki/Cgroups) to manage the CPU resources.
\ No newline at end of file
+Currently, you do not manage CPU resources within Drill. [Use Linux 
`cgroups`](http://en.wikipedia.org/wiki/Cgroups) to manage the CPU resources.
+
+## How to Manage Disk Resources
+
+The `planner.add_producer_consumer` system option enables or disables a 
secondary reading thread that works out of band of the rest of the scanning 
fragment to prefetch data from disk. If you interact with a certain type of 
storage medium that is slow or does not prefetch much data, this option tells 
Drill to add a producer consumer reading thread to the operation. Drill can 
then assign one thread that focuses on a single reading fragment. If Drill is 
using memory, you can disable this option to get better performance. If Drill 
is using disk space, you should enable this option and set a reasonable queue 
size for the `planner.producer_consumer_queue_size` option. For more 
information about these options, see the section,  "Performance Tuning".
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/drill/blob/64910ec6/_docs/configure-drill/060-configuring-a-shared-drillbit.md
----------------------------------------------------------------------
diff --git a/_docs/configure-drill/060-configuring-a-shared-drillbit.md 
b/_docs/configure-drill/060-configuring-a-shared-drillbit.md
index 52e3db4..4e47213 100644
--- a/_docs/configure-drill/060-configuring-a-shared-drillbit.md
+++ b/_docs/configure-drill/060-configuring-a-shared-drillbit.md
@@ -11,20 +11,7 @@ Set [options in 
sys.options]({{site.baseurl}}/docs/configuration-options-introdu
 * exec.queue.large  
 * exec.queue.small  
 
-### Example Configuration
-
-For example, you configure the queue reserved for large queries for a 5-query 
maximum. You configure the queue reserved for small queries for 20 queries. 
Users start to run queries, and Drill receives the following query requests in 
this order:
-
-* Query A (blue): 1 billion records, Drill estimates 10 million rows will be 
processed  
-* Query B (red): 2 billion records, Drill estimates 20 million rows will be 
processed  
-* Query C: 1 billion records  
-* Query D: 100 records
-
-The exec.queue.threshold default is 30 million, which is the estimated rows to 
be processed by the query. Queries A and B are queued in the large queue. The 
estimated rows to be processed reaches the 30 million threshold, filling the 
queue to capacity. The query C request arrives and goes on the wait list, and 
then query D arrives. Query D is queued immediately in the small queue because 
of its small size, as shown in the following diagram: 
-
-![drill queuing]({{ site.baseurl }}/docs/img/queuing.png)
-
-The Drill queuing configuration in this example tends to give many users 
running small queries a rapid response. Users running a large query might 
experience some delay until an earlier-received large query returns, freeing 
space in the large queue to process queries that are waiting.
+For more information, see the section, "Performance Tuning".
 
 ## Configuring Parallelization
 
@@ -39,7 +26,22 @@ To configure parallelization, configure the following 
options in the `sys.option
 * `planner.width.max.per.query`  
   Same as max per node but applies to the query as executed by the entire 
cluster.
 
-Configure the `planner.width.max.per.node` to achieve fine grained, absolute 
control over parallelization. 
+### planner.width.max_per_node
+Configure the `planner.width.max.per.node` to achieve fine grained, absolute 
control over parallelization. In this context *width* refers to fanout or 
distribution potential: the ability to run a query in parallel across the cores 
on a node and the nodes on a cluster. A physical plan consists of intermediate 
operations, known as query &quot;fragments,&quot; that run concurrently, 
yielding opportunities for parallelism above and below each exchange operator 
in the plan. An exchange operator represents a breakpoint in the execution flow 
where processing can be distributed. For example, a single-process scan of a 
file may flow into an exchange operator, followed by a multi-process 
aggregation fragment.
+
+The maximum width per node defines the maximum degree of parallelism for any 
fragment of a query, but the setting applies at the level of a single node in 
the cluster. The *default* maximum degree of parallelism per node is calculated 
as follows, with the theoretical maximum automatically scaled back (and rounded 
down) so that only 70% of the actual available capacity is taken into account: 
number of active drillbits (typically one per node) * number of cores per node 
* 0.7
+
+For example, on a single-node test system with 2 cores and hyper-threading 
enabled: 1 * 4 * 0.7 = 3
+
+When you modify the default setting, you can supply any meaningful number. The 
system does not automatically scale down your setting.
+
+### planner.width.max_per_query
+
+The max_per_query value also sets the maximum degree of parallelism for any 
given stage of a query, but the setting applies to the query as executed by the 
whole cluster (multiple nodes). In effect, the actual maximum width per query 
is the *minimum of two values*: min((number of nodes * width.max_per_node), 
width.max_per_query)
+
+For example, on a 4-node cluster where `width.max_per_node` is set to 6 and 
`width.max_per_query` is set to 30: min((4 * 6), 30) = 24
+
+In this case, the effective maximum width per query is 24, not 30.
 
 <!-- ??For example, setting the `planner.width.max.per.query` to 60 will not 
accelerate Drill operations because overlapping does not occur when executing 
60 queries at the same time.??
 

http://git-wip-us.apache.org/repos/asf/drill/blob/64910ec6/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
----------------------------------------------------------------------
diff --git 
a/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
 
b/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
index ee2ff9e..bdd19f3 100644
--- 
a/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
+++ 
b/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
@@ -22,8 +22,8 @@ The sys.options table lists the following options that you 
can set as a system o
 | exec.java_compiler                             | DEFAULT          | Switches 
between DEFAULT, JDK, and JANINO mode for the current session. Uses Janino by 
default for generated source code of less than 
exec.java_compiler_janino_maxsize; otherwise, switches to the JDK compiler.     
                                                                                
                                                           |
 | exec.java_compiler_debug                       | TRUE             | Toggles 
the output of debug-level compiler error messages in runtime generated code.    
                                                                                
                                                                                
                                                                                
                         |
 | exec.java_compiler_janino_maxsize              | 262144           | See the 
exec.java_compiler option comment. Accepts inputs of type LONG.                 
                                                                                
                                                                                
                                                                                
                         |
-| exec.max_hash_table_size                       | 1073741824       | Ending 
size for hash tables. Range: 0 - 1073741824. For internal use.                  
                                                                                
                                                                                
                                                                                
                          |
-| exec.min_hash_table_size                       | 65536            | Starting 
size for hash tables. Increase according to available memory to improve 
performance. Range: 0 - 1073741824. For internal use.                           
                                                                                
                                                                                
                                |
+| exec.max_hash_table_size                       | 1073741824       | Ending 
size for hash tables. Range: 0 - 1073741824.                                    
                                                                                
                                                                                
                                                                                
                          |
+| exec.min_hash_table_size                       | 65536            | Starting 
size for hash tables. Increase according to available memory to improve 
performance. Increasing for very large aggregations or joins when you have 
large amounts of memory for Drill to use. Range: 0 - 1073741824.                
                                                                                
                                     |
 | exec.queue.enable                              | FALSE            | Changes 
the state of query queues to control the number of queries that run 
simultaneously.                                                                 
                                                                                
                                                                                
                                     |
 | exec.queue.large                               | 10               | Sets the 
number of large queries that can run concurrently in the cluster. Range: 0-1000 
                                                                                
                                                                                
                                                                                
                        |
 | exec.queue.small                               | 100              | Sets the 
number of small queries that can run concurrently in the cluster. Range: 0-1001 
                                                                                
                                                                                
                                                                                
                        |
@@ -65,13 +65,13 @@ The sys.options table lists the following options that you 
can set as a system o
 | planner.partitioner_sender_max_threads         | 8                | Upper 
limit of threads for outbound queuing.                                          
                                                                                
                                                                                
                                                                                
                           |
 | planner.partitioner_sender_set_threads         | -1               | 
Overwrites the number of threads used to send out batches of records. Set to -1 
to disable. Typically not changed.                                              
                                                                                
                                                                                
                                 |
 | planner.partitioner_sender_threads_factor      | 2                | A 
heuristic param to use to influence final number of threads. The higher the 
value the fewer the number of threads.                                          
                                                                                
                                                                                
                                   |
-| planner.producer_consumer_queue_size           | 10               | How much 
data to prefetch from disk in record batches out-of-band of query execution.    
                                                                                
                                                                                
                                                                                
                        |
+| planner.producer_consumer_queue_size           | 10               | How much 
data to prefetch from disk in record batches out-of-band of query execution. 
The larger the queue size, the greater the amount of memory that the queue and 
overall query execution consumes.                                               
                                                                                
                            |
 | planner.slice_target                           | 100000           | The 
number of records manipulated within a fragment before Drill parallelizes 
operations.                                                                     
                                                                                
                                                                                
                                   |
 | planner.width.max_per_node                     | 3                | Maximum 
number of threads that can run in parallel for a query on a node. A slice is an 
individual thread. This number indicates the maximum number of slices per query 
for the queryâs major fragment on a node.                                     
                                                                                
                           |
 | planner.width.max_per_query                    | 1000             | Same as 
max per node but applies to the query as executed by the entire cluster. For 
example, this value might be the number of active Drillbits, or a higher number 
to return results faster.                                                       
                                                                                
                            |
 | store.format                                   | parquet          | Output 
format for data written to tables with the CREATE TABLE AS (CTAS) command. 
Allowed values are parquet, json, or text. Allowed values: 0, -1, 1000000       
                                                                                
                                                                                
                               |
 | store.json.all_text_mode                       | FALSE            | Drill 
reads all data from the JSON files as VARCHAR. Prevents schema change errors.   
                                                                                
                                                                                
                                                                                
                           |
-| store.json.extended_types                      | FALSE            | Turns on 
special JSON structures that Drill serializes for storing more type information 
than the [four basic JSON types] 
(http://docs.mongodb.org/manual/reference/mongodb-extended-json/).              
                                                                                
                                                                       |
+| store.json.extended_types                      | FALSE            | Turns on 
special JSON structures that Drill serializes for storing more type information 
than the [four basic JSON 
types](http://docs.mongodb.org/manual/reference/mongodb-extended-json/).        
                                                                                
                                                                              |
 | store.json.read_numbers_as_double              | FALSE            | Reads 
numbers with or without a decimal point as DOUBLE. Prevents schema change 
errors.                                                                         
                                                                                
                                                                                
                                 |
 | store.mongo.all_text_mode                      | FALSE            | Similar 
to store.json.all_text_mode for MongoDB.                                        
                                                                                
                                                                                
                                                                                
                         |
 | store.mongo.read_numbers_as_double             | FALSE            | Similar 
to store.json.read_numbers_as_double.                                           
                                                                                
                                                                                
                                                                                
                         |

http://git-wip-us.apache.org/repos/asf/drill/blob/64910ec6/_docs/configure-drill/configuration-options/030-planning-and-exececution-options.md
----------------------------------------------------------------------
diff --git 
a/_docs/configure-drill/configuration-options/030-planning-and-exececution-options.md
 
b/_docs/configure-drill/configuration-options/030-planning-and-exececution-options.md
index 2608538..d1c9a30 100644
--- 
a/_docs/configure-drill/configuration-options/030-planning-and-exececution-options.md
+++ 
b/_docs/configure-drill/configuration-options/030-planning-and-exececution-options.md
@@ -14,42 +14,4 @@ Use the ALTER SYSTEM or ALTER SESSION commands to set 
options. Typically,
 you set the options at the session level unless you want the setting to
 persist across all sessions.
 
-The summary of system options lists default values. The following descriptions 
provide more detail on some of these options:
-
-### exec.min_hash_table_size
-
-The default starting size for hash tables. Increasing this size is useful for 
very large aggregations or joins when you have large amounts of memory for 
Drill to use. Drill can spend a lot of time resizing the hash table as it finds 
new data. If you have large data sets, you can increase this hash table size to 
increase performance.
-
-### planner.add_producer_consumer
-
-This option enables or disables a secondary reading thread that works out of 
band of the rest of the scanning fragment to prefetch data from disk. If you 
interact with a certain type of storage medium that is slow or does not 
prefetch much data, this option tells Drill to add a producer consumer reading 
thread to the operation. Drill can then assign one thread that focuses on a 
single reading fragment. If Drill is using memory, you can disable this option 
to get better performance. If Drill is using disk space, you should enable this 
option and set a reasonable queue size for the 
planner.producer_consumer_queue_size option.
-
-### planner.broadcast_threshold
-
-Threshold, in terms of a number of rows, that determines whether a broadcast 
join is chosen for a query. Regardless of the setting of the broadcast_join 
option (enabled or disabled), a broadcast join is not chosen unless the right 
side of the join is estimated to contain fewer rows than this threshold. The 
intent of this option is to avoid broadcasting too many rows for join purposes. 
Broadcasting involves sending data across nodes and is a network-intensive 
operation. (The &quot;right side&quot; of the join, which may itself be a join 
or simply a table, is determined by cost-based optimizations and heuristics 
during physical planning.)
-
-### planner.enable_broadcast_join, planner.enable_hashagg, 
planner.enable_hashjoin, planner.enable_mergejoin, 
planner.enable_multiphase_agg, planner.enable_streamagg
-
-These options enable or disable specific aggregation and join operators for 
queries. These operators are all enabled by default and in general should not 
be disabled.</p><p>Hash aggregation and hash join are hash-based operations. 
Streaming aggregation and merge join are sort-based operations. Both hash-based 
and sort-based operations consume memory; however, currently, hash-based 
operations do not spill to disk as needed, but the sort-based operations do. If 
large hash operations do not fit in memory on your system, you may need to 
disable these operations. Queries will continue to run, using alternative plans.
-
-### planner.producer_consumer_queue_size
-
-Determines how much data to prefetch from disk (in record batches) out of band 
of query execution. The larger the queue size, the greater the amount of memory 
that the queue and overall query execution consumes.
-
-### planner.width.max_per_node
-
-In this context *width* refers to fanout or distribution potential: the 
ability to run a query in parallel across the cores on a node and the nodes on 
a cluster. A physical plan consists of intermediate operations, known as query 
&quot;fragments,&quot; that run concurrently, yielding opportunities for 
parallelism above and below each exchange operator in the plan. An exchange 
operator represents a breakpoint in the execution flow where processing can be 
distributed. For example, a single-process scan of a file may flow into an 
exchange operator, followed by a multi-process aggregation fragment.
-
-The maximum width per node defines the maximum degree of parallelism for any 
fragment of a query, but the setting applies at the level of a single node in 
the cluster. The *default* maximum degree of parallelism per node is calculated 
as follows, with the theoretical maximum automatically scaled back (and rounded 
down) so that only 70% of the actual available capacity is taken into account: 
number of active drillbits (typically one per node) * number of cores per node 
* 0.7
-
-For example, on a single-node test system with 2 cores and hyper-threading 
enabled: 1 * 4 * 0.7 = 3
-
-When you modify the default setting, you can supply any meaningful number. The 
system does not automatically scale down your setting.
-
-### planner.width.max_per_query
-
-The max_per_query value also sets the maximum degree of parallelism for any 
given stage of a query, but the setting applies to the query as executed by the 
whole cluster (multiple nodes). In effect, the actual maximum width per query 
is the *minimum of two values*: min((number of nodes * width.max_per_node), 
width.max_per_query)
-
-For example, on a 4-node cluster where `width.max_per_node` is set to 6 and 
`width.max_per_query` is set to 30: min((4 * 6), 30) = 24
-
-In this case, the effective maximum width per query is 24, not 30.
\ No newline at end of file
+The [summary of system 
options]({{site.baseurl}}/docs/configuration-options-introduction) lists 
default values and a short description of the planning and execution options. 
The planning option names have a planning preface. Execution options have an 
exe preface. For more information, see the section, "Performance Turning".The 
following descriptions provide more detail on some of these options:

[02/31] drill git commit: coordinate with Bridget's draft of Perf Tuning

Reply via email to