[ 
https://issues.apache.org/jira/browse/HAWQ-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217584#comment-15217584
 ] 

Lili Ma commented on HAWQ-592:
------------------------------

The root cause is QD keeps an executor cache of all the QEs in its side, and it 
will try to fetch QE from the pool for next query.  During fetching, QD will 
check the aliveness of the QE, and will try to establish a new connection to 
the segment which is specified by 'task->segment' if finding some QE becomes 
invalid. For sql "set log_min_messages=debug1", its execution logic is 
dispatching the command string to all existing QEs, and task->segment=NULL, so 
QD meets error when trying to establish new connection.
Actually, if the QE fails, there's no need for the QD to connect to any segment 
for this kind of sql "set log_min_messages=debug1", so we can mark the executor 
as invalid. 

> QD fails when connects to QE again in executormgr_allocate_any_executor()
> -------------------------------------------------------------------------
>
>                 Key: HAWQ-592
>                 URL: https://issues.apache.org/jira/browse/HAWQ-592
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Dispatcher
>    Affects Versions: 2.0.0
>            Reporter: Chunling Wang
>            Assignee: Lili Ma
>
> We first run a query to get some QEs. Then we kill one and run "set 
> log_min_messages=DEBUG1" to let QD get executormgr_allocate_any_executor(). 
> We find QD failed.
> 1. Run query to get some QEs.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
> test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
>  count
> -------
>   3725
> (1 row)
> {code}
> {code}
> $ ps -ef|grep postgres
>   501 12817     1   0  4:41下午 ??         0:00.36 /usr/local/hawq/bin/postgres 
> -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 
> --silent-mode=true
>   501 12818 12817   0  4:41下午 ??         0:00.01 postgres: port  5432, master 
> logger process
>   501 12821 12817   0  4:41下午 ??         0:00.00 postgres: port  5432, stats 
> collector process
>   501 12822 12817   0  4:41下午 ??         0:00.03 postgres: port  5432, writer 
> process
>   501 12823 12817   0  4:41下午 ??         0:00.00 postgres: port  5432, 
> checkpoint process
>   501 12824 12817   0  4:41下午 ??         0:00.00 postgres: port  5432, 
> seqserver process
>   501 12825 12817   0  4:41下午 ??         0:00.00 postgres: port  5432, WAL 
> Send Server process
>   501 12826 12817   0  4:41下午 ??         0:00.00 postgres: port  5432, DFS 
> Metadata Cache process
>   501 12827 12817   0  4:41下午 ??         0:00.16 postgres: port  5432, master 
> resource manager
>   501 12844     1   0  4:41下午 ??         0:00.57 /usr/local/hawq/bin/postgres 
> -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 40000 
> --silent-mode=true
>   501 12845 12844   0  4:41下午 ??         0:00.01 postgres: port 40000, logger 
> process
>   501 12856 12862   0  4:42下午 ??         0:00.05 postgres: port  5432, 
> wangchunling dispatch [local] con13 cmd10 idle [local]
>   501 12872 12844   0  4:42下午 ??         0:00.00 postgres: port 40000, stats 
> collector process
>   501 12873 12844   0  4:42下午 ??         0:00.01 postgres: port 40000, writer 
> process
>   501 12874 12844   0  4:42下午 ??         0:00.00 postgres: port 40000, 
> checkpoint process
>   501 12875 12844   0  4:42下午 ??         0:00.03 postgres: port 40000, 
> segment resource manager
> {code}
> 2. Kill -9 some QE and wait segment up.
> {code}
> $ ps -ef|grep postgres
>   501 12817     1   0  4:41下午 ??         0:00.91 /usr/local/hawq/bin/postgres 
> -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 
> --silent-mode=true
>   501 12818 12817   0  4:41下午 ??         0:00.05 postgres: port  5432, master 
> logger process
>   501 12844     1   0  4:41下午 ??         0:01.52 /usr/local/hawq/bin/postgres 
> -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 40000 
> --silent-mode=true
>   501 12845 12844   0  4:41下午 ??         0:00.04 postgres: port 40000, logger 
> process
>   501 12872 12844   0  4:42下午 ??         0:00.02 postgres: port 40000, stats 
> collector process
>   501 12873 12844   0  4:42下午 ??         0:00.19 postgres: port 40000, writer 
> process
>   501 12874 12844   0  4:42下午 ??         0:00.03 postgres: port 40000, 
> checkpoint process
>   501 12875 12844   0  4:42下午 ??         0:00.41 postgres: port 40000, 
> segment resource manager
>   501 12932 12817   0  4:52下午 ??         0:00.00 postgres: port  5432, stats 
> collector process
>   501 12933 12817   0  4:52下午 ??         0:00.01 postgres: port  5432, writer 
> process
>   501 12934 12817   0  4:52下午 ??         0:00.00 postgres: port  5432, 
> checkpoint process
>   501 12935 12817   0  4:52下午 ??         0:00.00 postgres: port  5432, 
> seqserver process
>   501 12936 12817   0  4:52下午 ??         0:00.00 postgres: port  5432, WAL 
> Send Server process
>   501 12937 12817   0  4:52下午 ??         0:00.00 postgres: port  5432, DFS 
> Metadata Cache process
>   501 12938 12817   0  4:52下午 ??         0:00.04 postgres: port  5432, master 
> resource manager
>   501 12952 12817   0  4:53下午 ??         0:00.00 postgres: port  5432, 
> wangchunling dispatch [local] con30 idle [local]
> {code}
> {code}
> dispatch=# select * from gp_segment_configuration;
>  registration_order | role | status | port  |          hostname           |   
>         address           |            description
> --------------------+------+--------+-------+-----------------------------+-----------------------------+------------------------------------
>                   0 | m    | u      |  5432 | ChunlingdeMacBook-Pro.local | 
> ChunlingdeMacBook-Pro.local |
>                   1 | p    | d      | 40000 | localhost                   | 
> 127.0.0.1                   | resource manager process was reset
> (2 rows)
> dispatch=# select * from gp_segment_configuration;
>  registration_order | role | status | port  |          hostname           |   
>         address           | description
> --------------------+------+--------+-------+-----------------------------+-----------------------------+-------------
>                   0 | m    | u      |  5432 | ChunlingdeMacBook-Pro.local | 
> ChunlingdeMacBook-Pro.local |
>                   1 | p    | u      | 40000 | localhost                   | 
> 127.0.0.1                   |
> (2 rows)
> {code}
> 3. Run "set log_min_messages=DEBUG1" and find QD failed.
> {code}
> dispatch=# set log_min_messages=DEBUG1;
> The connection to the server was lost. Attempting reset: Failed.
> !>
> {code}
> The backtrace when QD fails:
> {code}
> * thread #1: tid = 0x2ff2e7, 0x00007fff87d60380 
> libsystem_platform.dylib`_platform_memmove$VARIANT$Nehalem + 64, queue = 
> 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
>   * frame #0: 0x00007fff87d60380 
> libsystem_platform.dylib`_platform_memmove$VARIANT$Nehalem + 64
>     frame #1: 0x00007fff8a0a82e2 libsystem_c.dylib`__memcpy_chk + 22
>     frame #2: 0x000000010c299469 postgres`CopySegment(src=0x0000000000000000, 
> cxt=0x00007fa303d07d90) + 137 at cdbutil.c:168
>     frame #3: 0x000000010c2c0df2 
> postgres`executormgr_prepare_connect(segment=0x0000000000000000, 
> is_writer='\x01') + 34 at executormgr.c:983
>     frame #4: 0x000000010c2bec26 
> postgres`dispmgt_build_preconnect_info(segment=0x0000000000000000, 
> is_writer='\x01', executor=0x00007fa3048787d8, data=0x00007fa304877e30, 
> slice=0x00007fa3048781a0, task=0x00007fa3048781c0) + 182 at 
> dispatcher_mgt.c:568
>     frame #5: 0x000000010c2b9ec4 
> postgres`dispatcher_bind_executor(data=0x00007fa304877e30) + 244 at 
> dispatcher.c:956
>     frame #6: 0x000000010c2b9bbb 
> postgres`dispatch_run(data=0x00007fa304877e30) + 219 at dispatcher.c:1237
>     frame #7: 0x000000010c2bb2bf 
> postgres`dispatch_statement(stmt=0x00007fff540b96e8, 
> resource=0x0000000000000000, result=0x0000000000000000) + 271 at 
> dispatcher.c:1491
>     frame #8: 0x000000010c2bb1a3 
> postgres`dispatch_statement_string(string=0x00007fa3068e4e40, 
> serializeQuerytree=0x0000000000000000, serializeLenQuerytree=0, 
> resource=0x0000000000000000, result=0x0000000000000000, 
> sync_on_all_executors='\x01') + 307 at dispatcher.c:1537
>     frame #9: 0x000000010c12aad9 
> postgres`SetPGVariableDispatch(name=0x00007fa30401c120, 
> args=0x00007fa30401c1d8, is_local='\0') + 713 at guc.c:10891
>     frame #10: 0x000000010bfe7e7e 
> postgres`ProcessUtility(parsetree=0x00007fa30401c208, 
> queryString=0x00007fa3068d3e30, params=0x0000000000000000, isTopLevel='\x01', 
> dest=0x00007fa30401c568, completionTag=0x00007fff540b9f90) + 8318 at 
> utility.c:1519
>     frame #11: 0x000000010bfe5810 
> postgres`PortalRunUtility(portal=0x00007fa304821430, 
> utilityStmt=0x00007fa30401c208, isTopLevel='\x01', dest=0x00007fa30401c568, 
> completionTag=0x00007fff540b9f90) + 464 at pquery.c:1896
>     frame #12: 0x000000010bfe3e4b 
> postgres`PortalRunMulti(portal=0x00007fa304821430, isTopLevel='\x01', 
> dest=0x00007fa30401c568, altdest=0x00007fa30401c568, 
> completionTag=0x00007fff540b9f90) + 539 at pquery.c:2006
>     frame #13: 0x000000010bfe33b5 
> postgres`PortalRun(portal=0x00007fa304821430, count=9223372036854775807, 
> isTopLevel='\x01', dest=0x00007fa30401c568, altdest=0x00007fa30401c568, 
> completionTag=0x00007fff540b9f90) + 1269 at pquery.c:1523
>     frame #14: 0x000000010bfd9703 
> postgres`exec_simple_query(query_string=0x00007fa30401b830, 
> seqServerHost=0x0000000000000000, seqServerPort=-1) + 2179 at postgres.c:1745
>     frame #15: 0x000000010bfd7b50 postgres`PostgresMain(argc=4, 
> argv=0x00007fa30680ba10, username=0x00007fa30680b9d0) + 7472 at 
> postgres.c:4754
>     frame #16: 0x000000010bf7bfd6 
> postgres`BackendRun(port=0x00007fa303c18c50) + 1014 at postmaster.c:5889
>     frame #17: 0x000000010bf7b121 
> postgres`BackendStartup(port=0x00007fa303c18c50) + 385 at postmaster.c:5484
>     frame #18: 0x000000010bf77d90 postgres`ServerLoop + 1312 at 
> postmaster.c:2163
>     frame #19: 0x000000010bf763d3 postgres`PostmasterMain(argc=9, 
> argv=0x00007fa303d07a60) + 4931 at postmaster.c:1454
>     frame #20: 0x000000010be80af2 postgres`main(argc=9, 
> argv=0x00007fa303d07a60) + 978 at main.c:226
>     frame #21: 0x00007fff95e8c5c9 libdyld.dylib`start + 1
>   thread #2: tid = 0x2ff2e8, 0x00007fff890355be libsystem_kernel.dylib`poll + 
> 10
>     frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10
>     frame #1: 0x000000010c1e3fed 
> postgres`rxThreadFunc(arg=0x0000000000000000) + 317 at ic_udp.c:6251
>     frame #2: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
>     frame #3: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
>     frame #4: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13
>   thread #4: tid = 0x2ff41f, 0x00007fff890343f6 
> libsystem_kernel.dylib`__select + 10
>     frame #0: 0x00007fff890343f6 libsystem_kernel.dylib`__select + 10
>     frame #1: 0x000000010c2c6acb postgres`pg_usleep(microsec=1000000) + 91 at 
> pgsleep.c:43
>     frame #2: 0x000000010c1799ca 
> postgres`generateResourceRefreshHeartBeat(arg=0x00007fa303e03380) + 1482 at 
> rmcomm_QD2RM.c:1546
>     frame #3: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
>     frame #4: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
>     frame #5: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to