[ 
https://issues.apache.org/jira/browse/HAWQ-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044753#comment-15044753
 ] 

Ming LI edited comment on HAWQ-232 at 12/7/15 11:01 AM:
--------------------------------------------------------

The hot spot is in the startup 3 phrase. 
Most of the hang problem is occurs when a long load command which running 
against a big partition table were killed, this load command will append some 
the data to some existing files, and also it may create some new files for some 
new virtual segments to execute INSERT command( because all existing files have 
been used for other statements or for other virtual segments). And if this 
INSERT statement  crashed, then it will take a long time to redo the 
create/drop hdfs files.

And I trace more info, in fact all persistent table state and relation dir and 
relfile on hdfs are keep in consistent status at recovery PASS2.

---------------------------- log of 16385/16387/472145/5 at recovery PASS2 
-----------------------
144449 2015-11-25 16:07:35.083979 
CST,,,p32839,th-1553125376,,,,0,,,seg-10000,,,x1013,,"PANIC","XX000","Found 
persistent file-system object in unexpected stat       e 'Create Pending' 
(cdbpersistentfilesysobj.c:1280)",,,,,"State change is for persistent 'Relation 
File: '16385/16387/23281/2'' to go from states 'Drop        Pending' or 
'Aborting Create' to state 'Free' with serial number 19592 at TID (43,70)
144450 Dropping file-system object -- Relation File: 
'16385/16387/23281/2'",,0,,"cdbpersistentfilesysobj.c",1280,"Stack trace:
144451 1    0x565925 postgres errstart + 0x3c4
144452 2    0x6ecc84 postgres PersistentFileSysObj_VerifyExpected + 0x323
144453 3    0x6ed0ac postgres PersistentFileSysObj_StateChange + 0x35a
144454 4    0x6db667 postgres PersistentRelfile_Dropped + 0x44e
144455 5    0x6ee1f6 postgres PersistentFileSysObj_DropObject + 0x56c
144456 6    0x6f3d86 postgres PersistentRecovery_DropType + 0x34c
144457 7    0x6f3e4d postgres PersistentRecovery_Drop + 0x66
144458 8    0x8a856 postgres StartupXLOG_Pass2 + 0x242
144459 9    0x90b5a postgres StartupProcessMain + 0x412
144460 10   0x121e5e postgres AuxiliaryProcessMain + 0x869
144461 11   0x411a3c postgres StartChildProcess + 0x187
144462 12   0x40d3c9 postgres do_reaper + 0x306
144463 13   0x409e26 postgres ServerLoop + 0x3d8
144464 14   0x408c32 postgres PostmasterMain + 0x1407
144465 15   0x32019f postgres main + 0x3ef
144466 16   0x2acd postgres _start + 0xd4
144467 17   0x29f8 postgres start + 0x28

However at the recovery PASS3,  it will redo all relfile and relation dir 
create/drop for on flight transaction:

---------------------------- log of 16385/16387/472145/5 at recovery PASS3 
-----------------------
2015-11-24 12:40:57.991782 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","Scan REDO: Add 
Relation File: '16385/16387/4721455': state 'Create Pending' , transaction 
1449, relation storage manager 'Append-Only', persistent serial number 165517, 
TID (3,320)",,,,,,,0,,"cdbpersistentrecovery.c",968,
2015-11-24 12:40:59.367585 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","Summary REDO: 
Relation File: '16385/16387/4721455': state 'Create Pending' , transaction 
1449, relation storage manager 'Append-Only', persistent serial number 165517, 
TID (3,320)",,,,,,,0,,"cdbpersistentrecovery.c",884,
2015-11-24 12:40:59.878577 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","Needs-Abort REDO: 
(New state 'Aborting Create') Relation File: '16385/16387/4721455': state 
'Create Pending' , transaction 1449, relation storage manager 'Append-Only', 
persistent serial number 165517, TID 
(3,320)",,,,,,,0,,"cdbpersistentrecovery.c",1417,
2015-11-24 12:41:00.450693 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,x1449,,"LOG","00000","PersistentRecovery_CrashAbort:
 Set state of Relation File: '16385/16387/4721455': state 'Create Pending' , 
transaction 1449, relation storage manager 'Append-Only', persistent serial 
number 165517, TID (3,320) to 'Aborting 
Create'",,,,,,,0,,"cdbpersistentrecovery.c",1525,
2015-11-24 12:41:01.549163 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,x1449,,"LOG","00000","End-Xact-Update-Needed
 REDO: Relation File: '16385/16387/4721455': state 'Aborting Create' , 
transaction 1449, relation storage manager 'Append-Only', persistent serial 
number 165517, TID (3,320)",,,,,,,0,,"cdbpersistentrecovery.c",1609,
2015-11-24 12:41:08.117435 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,x1449,,"LOG","00000","Drop REDO: 
Relation File: '16385/16387/4721455': state 'Aborting Create' , transaction 
1449, relation storage manager 'Append-Only', persistent serial number 165517, 
TID (3,320)",,,,,,,0,,"cdbpersistentrecovery.c",1734,
2015-11-24 12:41:08.117472 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,x1449,,"LOG","00000","PersistentFileSysObj_DropObject:
 before drop of Relation File: '16385/16387/4721455', persistent serial number 
165517, TID (3,320)",,,,,"Dropping file-system object -- Relation File: 
'16385/16387/4721455'",,0,,"cdbpersistentfilesysobj.c",1965,
2015-11-24 12:41:08.117802 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,x1449,,"LOG","00000","PersistentFileSysObj_DropObject:
 after drop of Relation File: '16385/16387/4721455', persistent serial number 
165517, TID (3,320)",,,,,"Dropping file-system object -- Relation File: 
'16385/16387/4721455'",,0,,"cdbpersistentfilesysobj.c",1993,

2015-11-24 12:41:15.042114 
CST,,,p5598,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","PersistentRecovery_ShouldHandlePass3XLogRec:
 non-data change XLOG record",,,,,"REDO PASS 3 @ 1/5E1991DC; LSN 1/5E199614: 
prev 1/5E199174; xid 1449: Master Mirror Log Records - create file: path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 
16384",,0,,"cdbpersistentrecovery.c",774,
2015-11-24 12:41:15.042162 
CST,,,p5598,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","mmxlog_redo: 
create file request 16: path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 16384, 
16385/16387/472145, path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5""",,,,,"xlog redo create 
file: path ""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 16384
REDO PASS 3 @ 1/5E1991DC; LSN 1/5E199614: prev 1/5E199174; xid 1449: Master 
Mirror Log Records - create file: path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 
16384",,0,,"xlog_mm.c",212,
2015-11-24 12:42:36.464366 
CST,,,p5598,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","PersistentRecovery_ShouldHandlePass3XLogRec:
 non-data change XLOG record",,,,,"REDO PASS 3 @ 1/608756E0; LSN 1/60875B18: 
prev 1/60875678; xid 1449: Master Mirror Log Records - remove file: path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 
16384",,0,,"cdbpersistentrecovery.c",774,
2015-11-24 12:42:36.464422 
CST,,,p5598,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","mmxlog_redo: 
remove file request 48: path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 16384, 
16385/16387/472145, path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5""",,,,,"xlog redo remove 
file: path ""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 16384
REDO PASS 3 @ 1/608756E0; LSN 1/60875B18: prev 1/60875678; xid 1449: Master 
Mirror Log Records - remove file: path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 
16384",,0,,"xlog_mm.c",254,


So we need to pass redo all relfile and relation dir create/drop for on flight 
transaction at recovery PASS3.


was (Author: mli):
The hot spot is in the startup 3 phrase. 
Most of the hang problem is occurs when a long load command which running 
against a big partition table were killed, this load command will append some 
the data to some existing files, and also it may create some new files for some 
new virtual segments to execute INSERT command( because all existing files have 
been used for other statements or for other virtual segments). And if this 
INSERT statement  crashed, then it will take a long time to redo the 
create/drop hdfs files.

And I trace more info, in fact all persistent table state and relation dir and 
relfile on hdfs are keep in consistent status at recovery PASS2.

---------------------------- log of 16385/16387/472145/5 at recovery PASS2 
-----------------------
144449 2015-11-25 16:07:35.083979 
CST,,,p32839,th-1553125376,,,,0,,,seg-10000,,,x1013,,"PANIC","XX000","Found 
persistent file-system object in unexpected stat       e 'Create Pending' 
(cdbpersistentfilesysobj.c:1280)",,,,,"State change is for persistent 'Relation 
File: '16385/16387/23281/2'' to go from states 'Drop        Pending' or 
'Aborting Create' to state 'Free' with serial number 19592 at TID (43,70)
144450 Dropping file-system object -- Relation File: 
'16385/16387/23281/2'",,0,,"cdbpersistentfilesysobj.c",1280,"Stack trace:
144451 1    0x565925 postgres errstart + 0x3c4
144452 2    0x6ecc84 postgres PersistentFileSysObj_VerifyExpected + 0x323
144453 3    0x6ed0ac postgres PersistentFileSysObj_StateChange + 0x35a
144454 4    0x6db667 postgres PersistentRelfile_Dropped + 0x44e
144455 5    0x6ee1f6 postgres PersistentFileSysObj_DropObject + 0x56c
144456 6    0x6f3d86 postgres PersistentRecovery_DropType + 0x34c
144457 7    0x6f3e4d postgres PersistentRecovery_Drop + 0x66
144458 8    0x8a856 postgres StartupXLOG_Pass2 + 0x242
144459 9    0x90b5a postgres StartupProcessMain + 0x412
144460 10   0x121e5e postgres AuxiliaryProcessMain + 0x869
144461 11   0x411a3c postgres StartChildProcess + 0x187
144462 12   0x40d3c9 postgres do_reaper + 0x306
144463 13   0x409e26 postgres ServerLoop + 0x3d8
144464 14   0x408c32 postgres PostmasterMain + 0x1407
144465 15   0x32019f postgres main + 0x3ef
144466 16   0x2acd postgres _start + 0xd4
144467 17   0x29f8 postgres start + 0x28

However at the recovery PASS3,  it will redo all relfile and relation dir 
create/drop for on flight transaction:

---------------------------- log of 16385/16387/472145/5 at recovery PASS3 
-----------------------
2015-11-24 12:40:57.991782 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","Scan REDO: Add 
Relation File: '16385/16387/4721455': state 'Create Pending' , transaction 
1449, relation storage manager 'Append-Only', persistent serial number 165517, 
TID (3,320)",,,,,,,0,,"cdbpersistentrecovery.c",968,
2015-11-24 12:40:59.367585 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","Summary REDO: 
Relation File: '16385/16387/4721455': state 'Create Pending' , transaction 
1449, relation storage manager 'Append-Only', persistent serial number 165517, 
TID (3,320)",,,,,,,0,,"cdbpersistentrecovery.c",884,
2015-11-24 12:40:59.878577 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","Needs-Abort REDO: 
(New state 'Aborting Create') Relation File: '16385/16387/4721455': state 
'Create Pending' , transaction 1449, relation storage manager 'Append-Only', 
persistent serial number 165517, TID 
(3,320)",,,,,,,0,,"cdbpersistentrecovery.c",1417,
2015-11-24 12:41:00.450693 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,x1449,,"LOG","00000","PersistentRecovery_CrashAbort:
 Set state of Relation File: '16385/16387/4721455': state 'Create Pending' , 
transaction 1449, relation storage manager 'Append-Only', persistent serial 
number 165517, TID (3,320) to 'Aborting 
Create'",,,,,,,0,,"cdbpersistentrecovery.c",1525,
2015-11-24 12:41:01.549163 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,x1449,,"LOG","00000","End-Xact-Update-Needed
 REDO: Relation File: '16385/16387/4721455': state 'Aborting Create' , 
transaction 1449, relation storage manager 'Append-Only', persistent serial 
number 165517, TID (3,320)",,,,,,,0,,"cdbpersistentrecovery.c",1609,
2015-11-24 12:41:08.117435 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,x1449,,"LOG","00000","Drop REDO: 
Relation File: '16385/16387/4721455': state 'Aborting Create' , transaction 
1449, relation storage manager 'Append-Only', persistent serial number 165517, 
TID (3,320)",,,,,,,0,,"cdbpersistentrecovery.c",1734,
2015-11-24 12:41:08.117472 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,x1449,,"LOG","00000","PersistentFileSysObj_DropObject:
 before drop of Relation File: '16385/16387/4721455', persistent serial number 
165517, TID (3,320)",,,,,"Dropping file-system object -- Relation File: 
'16385/16387/4721455'",,0,,"cdbpersistentfilesysobj.c",1965,
2015-11-24 12:41:08.117802 
CST,,,p5526,th-1553125376,,,,0,,,seg-10000,,,x1449,,"LOG","00000","PersistentFileSysObj_DropObject:
 after drop of Relation File: '16385/16387/4721455', persistent serial number 
165517, TID (3,320)",,,,,"Dropping file-system object -- Relation File: 
'16385/16387/4721455'",,0,,"cdbpersistentfilesysobj.c",1993,

2015-11-24 12:41:15.042114 
CST,,,p5598,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","PersistentRecovery_ShouldHandlePass3XLogRec:
 non-data change XLOG record",,,,,"REDO PASS 3 @ 1/5E1991DC; LSN 1/5E199614: 
prev 1/5E199174; xid 1449: Master Mirror Log Records - create file: path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 
16384",,0,,"cdbpersistentrecovery.c",774,
2015-11-24 12:41:15.042162 
CST,,,p5598,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","mmxlog_redo: 
create file request 16: path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 16384, 
16385/16387/472145, path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5""",,,,,"xlog redo create 
file: path ""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 16384
REDO PASS 3 @ 1/5E1991DC; LSN 1/5E199614: prev 1/5E199174; xid 1449: Master 
Mirror Log Records - create file: path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 
16384",,0,,"xlog_mm.c",212,
2015-11-24 12:42:36.464366 
CST,,,p5598,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","PersistentRecovery_ShouldHandlePass3XLogRec:
 non-data change XLOG record",,,,,"REDO PASS 3 @ 1/608756E0; LSN 1/60875B18: 
prev 1/60875678; xid 1449: Master Mirror Log Records - remove file: path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 
16384",,0,,"cdbpersistentrecovery.c",774,
2015-11-24 12:42:36.464422 
CST,,,p5598,th-1553125376,,,,0,,,seg-10000,,,,,"LOG","00000","mmxlog_redo: 
remove file request 48: path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 16384, 
16385/16387/472145, path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5""",,,,,"xlog redo remove 
file: path ""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 16384
REDO PASS 3 @ 1/608756E0; LSN 1/60875B18: prev 1/60875678; xid 1449: Master 
Mirror Log Records - remove file: path 
""hdfs://localhost:9000/hawq2/16385/16387/472145/5"", filespace 
16384",,0,,"xlog_mm.c",254,


So we need to pass redo all relfile and relation dir create/drop for on flight 
transaction.

> Crash on INSERT into big partition table will cause a long time recovery
> ------------------------------------------------------------------------
>
>                 Key: HAWQ-232
>                 URL: https://issues.apache.org/jira/browse/HAWQ-232
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Transaction
>            Reporter: Ming LI
>            Assignee: Lei Chang
>
> The sample testcase is below:
> DROP TABLE integer_part1;
> CREATE TABLE integer_part1 (a int, b int, c int)
> WITH (appendonly=true, compresslevel=5)
> partition by range (a)
> (
>  partition aa start (0) end (1000) every (1)
> );
> CHECKPOINT;
> insert into integer_part1 select i,i,i from generate_series(0, 999) i;
> -- run shell: killall -9 postgres 
> -- Then restart hawq.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to