[ https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220943#comment-15220943 ]
ASF GitHub Bot commented on HAWQ-564: ------------------------------------- GitHub user jiny2 opened a pull request: https://github.com/apache/incubator-hawq/pull/543 HAWQ-564. QD hangs when connecting to resource manager You can merge this pull request into a Git repository by running: $ git pull https://github.com/jiny2/incubator-hawq HAWQ0564 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hawq/pull/543.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #543 ---- commit 449951ad0e233f436388f11ffd06107343ce538a Author: YI JIN <y...@pivotal.io> Date: 2016-04-01T00:51:44Z HAWQ-564. QD hangs when connecting to resource manager ---- > QD hangs when connecting to resource manager > -------------------------------------------- > > Key: HAWQ-564 > URL: https://issues.apache.org/jira/browse/HAWQ-564 > Project: Apache HAWQ > Issue Type: Bug > Components: Resource Manager > Affects Versions: 2.0.0 > Reporter: Chunling Wang > Assignee: Yi Jin > > When first inject panic in QE process, we run a query and segment is down. > After the segment is up, we run another query and get correct answer. Then we > inject the same panic second time. After the segment is down and then up > again, we run a query and find QD process hangs when connecting to resource > manager. Here is the backtrace when QD hangs: > {code} > * thread #1: tid = 0x21d8be, 0x00007fff890355be libsystem_kernel.dylib`poll + > 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP > * frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10 > frame #1: 0x0000000101daeafe postgres`processAllCommFileDescs + 158 at > rmcomm_AsyncComm.c:156 > frame #2: 0x0000000101db85f5 > postgres`callSyncRPCRemote(hostname=0x00007f9c19e00cd0, port=5437, > sendbuff=0x00007f9c1b918f50, sendbuffsize=80, sendmsgid=259, > exprecvmsgid=2307, recvsmb=<unavailable>, errorbuf=0x000000010230c1a0, > errorbufsize=<unavailable>) + 645 at rmcomm_SyncComm.c:122 > frame #3: 0x0000000101db2d85 postgres`acquireResourceFromRM [inlined] > callSyncRPCToRM(sendbuff=0x00007f9c1b918f50, sendbuffsize=<unavailable>, > sendmsgid=259, exprecvmsgid=2307, recvsmb=0x00007f9c1b918e70, > errorbuf=<unavailable>, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780 > frame #4: 0x0000000101db2d3c > postgres`acquireResourceFromRM(index=<unavailable>, sessionid=12, > slice_size=462524016, iobytes=134217728, preferred_nodes=0x00007f9c1a02d398, > preferred_nodes_size=<unavailable>, max_seg_count_fix=<unavailable>, > min_seg_count_fix=<unavailable>, errorbuf=<unavailable>, > errorbufsize=<unavailable>) + 572 at rmcomm_QD2RM.c:742 > frame #5: 0x0000000101c979e7 postgres`AllocateResource(life=QRL_ONCE, > slice_size=5, iobytes=134217728, max_target_segment_num=1, > min_target_segment_num=1, vol_info=0x00007f9c1a02d398, vol_info_size=1) + 631 > at pquery.c:796 > frame #6: 0x0000000101e8c60f > postgres`calculate_planner_segment_num(query=<unavailable>, > resourceLife=QRL_ONCE, fullRangeTable=<unavailable>, > intoPolicy=<unavailable>, sliceNum=5) + 14287 at cdbdatalocality.c:4207 > frame #7: 0x0000000101c0f671 postgres`planner + 106 at planner.c:496 > frame #8: 0x0000000101c0f607 postgres`planner(parse=0x00007f9c1a02a140, > cursorOptions=<unavailable>, boundParams=0x0000000000000000, > resourceLife=QRL_ONCE) + 311 at planner.c:310 > frame #9: 0x0000000101c8eb33 > postgres`pg_plan_query(querytree=0x00007f9c1a02a140, > boundParams=0x0000000000000000, resource_life=QRL_ONCE) + 99 at postgres.c:837 > frame #10: 0x0000000101c956ae postgres`exec_simple_query + 21 at > postgres.c:911 > frame #11: 0x0000000101c95699 > postgres`exec_simple_query(query_string=0x00007f9c1a028a30, > seqServerHost=0x0000000000000000, seqServerPort=-1) + 1577 at postgres.c:1671 > frame #12: 0x0000000101c91a4c postgres`PostgresMain(argc=<unavailable>, > argv=<unavailable>, username=0x00007f9c1b808cf0) + 9404 at postgres.c:4754 > frame #13: 0x0000000101c4ae02 postgres`ServerLoop [inlined] BackendRun + > 105 at postmaster.c:5889 > frame #14: 0x0000000101c4ad99 postgres`ServerLoop at postmaster.c:5484 > frame #15: 0x0000000101c4ad99 postgres`ServerLoop + 9593 at > postmaster.c:2163 > frame #16: 0x0000000101c47d3b postgres`PostmasterMain(argc=<unavailable>, > argv=<unavailable>) + 5019 at postmaster.c:1454 > frame #17: 0x0000000101bb1aa9 postgres`main(argc=9, > argv=0x00007f9c19c1eef0) + 1433 at main.c:209 > frame #18: 0x00007fff95e8c5c9 libdyld.dylib`start + 1 > thread #2: tid = 0x21d8bf, 0x00007fff890355be libsystem_kernel.dylib`poll + > 10 > frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10 > frame #1: 0x0000000101dfe723 postgres`rxThreadFunc(arg=<unavailable>) + > 2163 at ic_udp.c:6251 > frame #2: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 > frame #3: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 > frame #4: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13 > thread #3: tid = 0x21d9c2, 0x00007fff890343f6 > libsystem_kernel.dylib`__select + 10 > frame #0: 0x00007fff890343f6 libsystem_kernel.dylib`__select + 10 > frame #1: 0x0000000101e9d42e postgres`pg_usleep(microsec=<unavailable>) + > 78 at pgsleep.c:43 > frame #2: 0x0000000101db1a66 > postgres`generateResourceRefreshHeartBeat(arg=0x00007f9c19f02480) + 166 at > rmcomm_QD2RM.c:1519 > frame #3: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 > frame #4: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 > frame #5: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13 > {code} > And here is the operations: > 1. Before injection, get query answer correctly. > {code} > dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, > test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; > count > ------- > 3725 > (1 row) > {code} > 2. Inject panic, fault triggered, and segment is down. > {code} > dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, > test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; > ERROR: fault triggered, fault name:'fail_qe_when_do_query' fault > type:'panic' (faultinjector.c:656) (seg0 localhost:40000 pid=26936) > dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, > test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; > ERROR: failed to acquire resource from resource manager, 1 of 1 segments is > unavailable (pquery.c:807) > {code} > 3. After a while and when segment is up, get correct answer. > {code} > dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, > test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; > count > ------- > 3725 > (1 row) > {code} > 4. Inject again, fault triggered, and segment is down. > {code} > dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, > test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; > ERROR: fault triggered, fault name:'fail_qe_when_do_query' fault > type:'panic' (faultinjector.c:656) (seg0 localhost:40000 pid=26994) > dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, > test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; > ERROR: failed to acquire resource from resource manager, 1 of 1 segments is > unavailable (pquery.c:807) > {code} > 5. After a while, run query and find QD hangs. > {code} > dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, > test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; > {code} > 6. Open another terminal, find segment is already up. > {code} > dispatch=# select * from gp_segment_configuration; > registration_order | role | status | port | hostname | > address | description > --------------------+------+--------+-------+-----------------------------+-----------------------------+------------- > 0 | m | u | 5432 | ChunlingdeMacBook-Pro.local | > ChunlingdeMacBook-Pro.local | > 1 | p | u | 40000 | localhost | > 127.0.0.1 | > (2 rows) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)