Clarify PXF segment control

Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/1f6714a3
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/1f6714a3
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/1f6714a3

Branch: refs/heads/develop
Commit: 1f6714a318255596fb6dbd15d3a49866d753294b
Parents: 42fa1bc
Author: Jane Beckman <[email protected]>
Authored: Thu Aug 18 16:55:24 2016 -0700
Committer: David Yozie <[email protected]>
Committed: Fri Aug 19 10:48:04 2016 -0700

----------------------------------------------------------------------
 bestpractices/general_bestpractices.html.md.erb | 1 +
 ddl/ddl-table.html.md.erb                       | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/1f6714a3/bestpractices/general_bestpractices.html.md.erb
----------------------------------------------------------------------
diff --git a/bestpractices/general_bestpractices.html.md.erb 
b/bestpractices/general_bestpractices.html.md.erb
index 3d991cd..6c663c3 100644
--- a/bestpractices/general_bestpractices.html.md.erb
+++ b/bestpractices/general_bestpractices.html.md.erb
@@ -17,6 +17,7 @@ When using HAWQ, adhere to the following guidelines for best 
results:
     -   **Available resources**. Resources available at query time. If more 
resources are available in the resource queue, the resources will be used.
     -   **Hash table and bucket number**. If the query involves only 
hash-distributed tables, and the bucket number (bucketnum) configured for all 
the hash tables is either the same bucket number for all tables or the table 
size for random tables is no more than 1.5 times larger than the size of hash 
tables for the hash tables, then the query's parallelism is fixed (equal to the 
hash table bucket number). Otherwise, the number of virtual segments depends on 
the query's cost and hash-distributed table queries will behave like queries on 
randomly distributed tables.
     -   **Query Type**: For queries with some user-defined functions or for 
external tables where calculating resource costs is difficult , then the number 
of virtual segments is controlled by `hawq_rm_nvseg_perquery_limit `and 
`hawq_rm_nvseg_perquery_perseg_limit` parameters, as well as by the ON clause 
and the location list of external tables. If the query has a hash result table 
(e.g. `INSERT into hash_table`) then the number of virtual segment number must 
be equal to the bucket number of the resulting hash table, If the query is 
performed in utility mode, such as for `COPY` and `ANALYZE` operations, the 
virtual segment number is calculated by different policies, which will be 
explained later in this section.
+    -   **PXF**: PXF external tables use the 
`default_hash_table_bucket_number` parameter, not the 
`hawq_rm_nvseg_perquery_perseg_limit` parameter, to control the number of 
virtual segments. 
 
     See [Query Performance](../query/query-performance.html#topic38) for more 
details.
 

http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/1f6714a3/ddl/ddl-table.html.md.erb
----------------------------------------------------------------------
diff --git a/ddl/ddl-table.html.md.erb b/ddl/ddl-table.html.md.erb
index a29409c..7120031 100644
--- a/ddl/ddl-table.html.md.erb
+++ b/ddl/ddl-table.html.md.erb
@@ -68,7 +68,7 @@ All HAWQ tables are distributed. The default is `DISTRIBUTED 
RANDOMLY` \(round-r
 
 Randomly distributed tables have benefits over hash distributed tables. For 
example, after expansion, HAWQ's elasticity feature lets it automatically use 
more resources without needing to redistribute the data. For extremely large 
tables, redistribution is very expensive. Also, data locality for randomly 
distributed tables is better, especially after the underlying HDFS 
redistributes its data during rebalancing or because of data node failures. 
This is quite common when the cluster is large.
 
-However, hash distributed tables can be faster than randomly distributed 
tables. For example, for TPCH queries, where there are several queries, HASH 
distributed tables can have performance benefits. Choose a distribution policy 
that best suits your application scenario. When you `CREATE TABLE`, you can 
also specify the `bucketnum` option. The `bucketnum` determines the number of 
hash buckets used in creating a hash-distributed table or for pxf external 
table intermediate processing. The number of buckets also affects how many 
virtual segments will be created when processing this data. The bucketnumber of 
a gpfdist external table is the number of gpfdist location, and the 
bucketnumber of a command external table is `ON #num`.
+However, hash distributed tables can be faster than randomly distributed 
tables. For example, for TPCH queries, where there are several queries, HASH 
distributed tables can have performance benefits. Choose a distribution policy 
that best suits your application scenario. When you `CREATE TABLE`, you can 
also specify the `bucketnum` option. The `bucketnum` determines the number of 
hash buckets used in creating a hash-distributed table or for PXF external 
table intermediate processing. The number of buckets also affects how many 
virtual segments will be created when processing this data. The bucketnumber of 
a gpfdist external table is the number of gpfdist location, and the 
bucketnumber of a command external table is `ON #num`. PXF external tables use 
the `default_hash_table_bucket_number` parameter to control virtual segments. 
 
 HAWQ's elastic execution runtime is based on virtual segments, which are 
allocated on demand, based on the cost of the query. Each node uses one 
physical segment and a number of dynamically allocated virtual segments 
distributed to different hosts, thus simplifying performance tuning. Large 
queries use large numbers of virtual segments, while smaller queries use fewer 
virtual segments. Tables do not need to be redistributed when nodes are added 
or removed.
 

Reply via email to