[ https://issues.apache.org/jira/browse/HAWQ-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15711332#comment-15711332 ]
Paul Guo commented on HAWQ-1183: -------------------------------- 16559 and 16561 are oid of the two external tables, one for hash and another for random. postgres=# select * from gp_distribution_policy; localoid | bucketnum | attrnums ----------+-----------+---------- 16554 | 6 | {1} 16559 | 1 | {1} 16561 | 1 | (3 rows) Looking into DefineExternalRelation(), it looks like for EXTTBL_TYPE_LOCATION, it set the bucket number as the (for our case is gpfdist) location number createStmt->policy->bucketnum = locLength; I talked with related designer, this seems to be a hack. In theory we should save location number and bucket number in different places in catalog tables. In short term, we could fix this soon with the patch below, @@ -970,7 +970,7 @@ DefineExternalRelation(CreateExternalStmt *createExtStmt) isweb, iswritable,&isCustom); if(!isCustom){ int locLength = list_length(exttypeDesc->location_list); - if (createStmt->policy && locLength > 0) + if (createStmt->policy && locLength > 0 && locLength > createStmt->policy->bucketnum) { createStmt->policy->bucketnum = locLength; } In the long run, we should save bucket number and location number in different place. > Writable external table with Hash distribution shows slow performance > --------------------------------------------------------------------- > > Key: HAWQ-1183 > URL: https://issues.apache.org/jira/browse/HAWQ-1183 > Project: Apache HAWQ > Issue Type: Bug > Reporter: Paul Guo > Assignee: Paul Guo > > Steps: > 1. Create tables and populate them. > drop table tbl1; > drop external table ext_tbl1; > drop external table ext_tbl1_random; > CREATE TABLE tbl1 (a int, b text) DISTRIBUTED BY (a); > INSERT INTO tbl1 VALUES (generate_series(1,1000),'aaa'); > INSERT INTO tbl1 VALUES (generate_series(1,10000),'bbb'); > INSERT INTO tbl1 VALUES (generate_series(1,100000),'bbc'); > INSERT INTO tbl1 VALUES (generate_series(1,1000000),'bdbc'); > INSERT INTO tbl1 VALUES (generate_series(1,1000000),'bdddbc'); > CREATE WRITABLE EXTERNAL TABLE ext_tbl1 > ( LIKE tbl1 ) > LOCATION ('gpfdist://127.0.0.1/tbl1.csv') > FORMAT 'CSV' (DELIMITER ',') > DISTRIBUTED BY (a); > CREATE WRITABLE EXTERNAL TABLE ext_tbl1_random > ( LIKE tbl1 ) > LOCATION ('gpfdist://127.0.0.1/tbl1.random.csv') > FORMAT 'CSV' (DELIMITER ',') > DISTRIBUTED RANDOMLY; > 2. Write the two external tables. We can find that the external table with > hash distribution is slow with inserting, and plan shows that it has 1 > workers only. > postgres=# explain analyze INSERT INTO ext_tbl1 SELECT * from tbl1; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > ------------------------------------------------------------------- > Insert (cost=0.00..509.20 rows=1000 width=8) > Rows out: Avg 2111000.0 rows x 1 workers. > Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 70/70 ms to first > row, 20304/20304 ms to end, start offset by 30/30 ms. > Executor memory: 1K bytes. > -> Result (cost=0.00..431.07 rows=1000 width=20) > Rows out: Avg 2111000.0 rows x 1 workers. > Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 61/61 ms to first > row, 2034/2034 ms to end, start offset by 30/30 ms > . > -> Redistribute Motion 1:1 (slice1; segments: 1) > (cost=0.00..431.05 rows=1000 width=8) > Hash Key: tbl1.a > Rows out: Avg 2111000.0 rows x 1 workers at destination. > Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 61/61 ms to first > row, 1370/1370 ms to end, sta > rt offset by 30/30 ms. > -> Table Scan on tbl1 (cost=0.00..431.01 rows=1000 width=8) > Rows out: Avg 2111000.0 rows x 1 workers. > Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 61/61 ms to first > row, 566/566 ms to end, start offset b > y 30/30 ms. > Slice statistics: > (slice0) Executor memory: 293K bytes (seg0:host67). > (slice1) Executor memory: 303K bytes (seg0:host67). > Statement statistics: > Memory used: 262144K bytes > Optimizer status: PQO version 1.684 > Dispatcher statistics: > executors used(total/cached/new connection): (2/0/2); dispatcher > time(total/connection/dispatch data): (17.095 ms/16.477 ms/0.053 ms). > dispatch data time(max/min/avg): (0.027 ms/0.025 ms/0.026 ms); consume > executor data time(max/min/avg): (0.051 ms/0.043 ms/0.047 ms); free executor > time(max/min/avg): (0.000 ms/0 > .000 ms/0.000 ms). > Data locality statistics: > data locality ratio: 1.000; virtual segment number: 1; different host > number: 1; virtual segment number per host(avg/min/max): (1/1/1); segment > size(avg/min/max): (46023656.000 B > /46023656 B/46023656 B); segment size with penalty(avg/min/max): > (46023656.000 B/46023656 B/46023656 B); continuity(avg/min/max): > (1.000/1.000/1.000); DFS metadatacache: 46.181 ms; > resource allocation: 1.837 ms; datalocality calculation: 1.180 ms. > Total runtime: 20538.524 ms > (22 rows) > postgres=# explain analyze INSERT INTO ext_tbl1 SELECT * from tbl1; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > -------------------------------------------- > Insert (cost=0.00..444.03 rows=167 width=8) > Rows out: Avg 351833.3 rows x 6 workers. > Max/Last(seg5:host67/seg0:host67) 351916/351849 rows with 89/171 ms to first > row, 4074/4209 ms to end, start offset by 46/45 ms. > Executor memory: 1K bytes avg, 1K bytes max (seg5:host67). > -> Result (cost=0.00..431.01 rows=167 width=20) > Rows out: Avg 351833.3 rows x 6 workers. > Max/Last(seg5:host67/seg0:host67) 351916/351849 rows with 77/148 ms to first > row, 292/392 ms to end, start offset by 46/45 ms. > -> Table Scan on tbl1 (cost=0.00..431.00 rows=167 width=8) > Rows out: Avg 351833.3 rows x 6 workers. > Max/Last(seg5:host67/seg2:host67) 351916/351855 rows with 77/152 ms to first > row, 158/257 ms to end, start offset by 46/42 > ms. > Slice statistics: > (slice0) Executor memory: 280K bytes avg x 6 workers, 280K bytes max > (seg5:host67). > Statement statistics: > Memory used: 262144K bytes > Optimizer status: PQO version 1.684 > Dispatcher statistics: > executors used(total/cached/new connection): (6/0/6); dispatcher > time(total/connection/dispatch data): (38.288 ms/37.708 ms/0.078 ms). > dispatch data time(max/min/avg): (0.028 ms/0.004 ms/0.012 ms); consume > executor data time(max/min/avg): (0.067 ms/0.014 ms/0.029 ms); free executor > time(max/min/avg): (0.000 ms/0 > .000 ms/0.000 ms). > Data locality statistics: > data locality ratio: 1.000; virtual segment number: 6; different host > number: 1; virtual segment number per host(avg/min/max): (6/6/6); segment > size(avg/min/max): (7670609.333 B/ > 7668464 B/7672344 B); segment size with penalty(avg/min/max): (0.000 B/0 B/0 > B); continuity(avg/min/max): (1.000/1.000/1.000); DFS metadatacache: 28.855 > ms; resource allocation: 12. > 933 ms; datalocality calculation: 0.190 ms. > Total runtime: 4333.663 ms > (18 rows) -- This message was sent by Atlassian JIRA (v6.3.4#6332)