[
https://issues.apache.org/jira/browse/HAWQ-273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069361#comment-15069361
]
Ming LI commented on HAWQ-273:
------------------------------
--- The root cause is:
after thread B is resumed by thread A : COMMIT; @pg_analyze_and_rewrite(), we
still use the old snapshot to call pg_plan_queries(). however in hawq2.0,
because of implementation of data locality and hdfs metadata cache, we fetch
eof of pg_aoseg.pg_aoseg_XXXX(new pg_aoseg table) in this function, so we got
the expired data. while in hawq1.3, it doesn't need this kind of info at
planner phrase.
--- The solution:
So we need to re-fetch the neweast snapshot at before planner.
--- details explanation:
in hawq2.0 we don't fetch the newest snapshot at below stack, so we get the
wrong tuple version of pg_aoseg.pg_aoseg_XXXX(new pg_aoseg table)
Because the ALTER with REORGANIZE statement will firstly insert the tuple with
eof=0 into pg_aoseg.pg_aoseg_XXXX, and then update it with real eof.
At the first tuple, the Txin=2, Tmax is not visible, so the first tuple
returned.
0 0x005c979c in HeapTupleSatisfiesSnapshot at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/utils/time/tqual.c:1293
1 0x00037032 in heapgetpage at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/access/heap/heapam.c:262
2 0x00038bc5 in heapgettup_pagemode at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/access/heap/heapam.c:656
3 0x0003b182 in heap_getnext at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/access/heap/heapam.c:1544
4 0x000509a6 in systable_getnext_or_prev at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/access/index/genam.c:345
5 0x000508d5 in systable_getnext at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/access/index/genam.c:311
6 0x00719595 in AOGetSegFileDataLocation at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/cdb/cdbdatalocality.c:1247
7 0x00717277 in get_block_locations_and_claculte_table_size at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/cdb/cdbdatalocality.c:850
8 0x007233db in calculate_planner_segment_num at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/cdb/cdbdatalocality.c:3975
9 0x003b2626 in resource_negotiator at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/optimizer/plan/planner.c:479
10 0x003b1e0b in planner at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/optimizer/plan/planner.c:292
11 0x00464c43 in pg_plan_query at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/tcop/postgres.c:815
12 0x00464d6e in pg_plan_queries at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/tcop/postgres.c:888
13 0x004664ae in exec_simple_query at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/tcop/postgres.c:1648
14 0x0046c0bc in PostgresMain at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/tcop/postgres.c:4692
15 0x0040cdb9 in BackendRun at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/postmaster/postmaster.c:5834
16 0x0040c153 in BackendStartup at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/postmaster/postmaster.c:5427
17 0x004058e5 in ServerLoop at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/postmaster/postmaster.c:2129
18 0x00404672 in PostmasterMain at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/postmaster/postmaster.c:1421
19 0x0031bcef in main at
/Users/gpadmin/workspace/hawq2/apache-hawq/src/backend/main/main.c:226
while in hawq1.3 we fetch the newest snapshot at below stack
0 GetTransactionSnapshot () at tqual.c:2432
1 0x000000000092bb91 in PortalStart (portal=0x25151f8, params=0x0,
snapshot=0x0, seqServerHost=0x0,
seqServerPort=-1) at pquery.c:755
2 0x0000000000922faf in exec_simple_query (
query_string=0x24e1788 "select count(*) from tbl_isolation;",
seqServerHost=0x0, seqServerPort=-1)
at postgres.c:1739
3 0x0000000000927f1b in PostgresMain (argc=4, argv=0x23c1500,
username=0x23c12f8 "gpadmin")
at postgres.c:4786
4 0x00000000008c389b in BackendRun (port=0x23892a0) at postmaster.c:6811
5 0x00000000008c2d30 in BackendStartup (port=0x23892a0) at postmaster.c:6408
6 0x00000000008bb5bd in ServerLoop () at postmaster.c:2350
7 0x00000000008ba213 in PostmasterMain (argc=18, argv=0x2390390) at
postmaster.c:1556
8 0x00000000007d37f2 in main (argc=18, argv=0x2390390) at main.c:217
> Concurrent read committed SELECT return 0 rows for AO table which is ALTERed
> with REORGANIZE by other transaction
> -----------------------------------------------------------------------------------------------------------------
>
> Key: HAWQ-273
> URL: https://issues.apache.org/jira/browse/HAWQ-273
> Project: Apache HAWQ
> Issue Type: Bug
> Components: Transaction
> Reporter: Ming LI
> Assignee: Ming LI
>
> testdb=# DROP TABLE tbl_isolation;
> DROP TABLE
> testdb=# CREATE TABLE tbl_isolation (a INT, b int, c int) WITH
> (appendonly=true);
> CREATE TABLE
> testdb=# INSERT INTO tbl_isolation SELECT generate_series(1, 100000),
> generate_series(1, 100000), generate_series(1, 100000);
> INSERT 0 100000
> •
> Thread A:
> testdb=# BEGIN transaction isolation level SERIALIZABLE;
> BEGIN
> testdb=# ALTER TABLE tbl_isolation set with ( reorganize='true') distributed
> randomly;
> ALTER TABLE
> •
> Thread B:
> testdb=# BEGIN transaction isolation level read committed;
> BEGIN
> testdb=# select count(*) from tbl_isolation;
> •
> Thread A:
> testdb=# commit;
> COMMIT
> •
> •
> Thread B:
> testdb=# select count(*) from tbl_isolation;
> count
> -------
> 0
> (1 row)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)