You might want to try compiling your application with Clang TSan. Although, since libuv is a single-threaded library (per uv_loop_t), any access from multiple threads is generally just forbidden.
On Sun, May 2, 2021 at 2:24 AM Paul Romero <pa...@rcom-software.com> wrote: > Hi: > > A trace giving a more refined picture of the problem and more details is > attached > to this note. It dumps what I consider to be the significant fields of the > uv_loop_t structure, > the Poll_Loop in my code, and the the dynamically allocated uv_poll_t > structure surrounding > the essential Libuv API routine calls. I think the format of the dump is > fairly intuitive > but please inform me if you disagree. > > My comments about what is happening are identified by the string > "COMMENT:" and the > API calls are identified by the string "MPOLL". The trace is redacted to > eliminate > irrelevant events. > > Please let me know if you see something amiss about the data structure > fields at > any point. > > My server code is based on simple event oriented architecture and uses the > following 5 concurrent > tasks. > > 1) main(): Handles incoming connections and Linux signals. (Libuv based.) > 2) IO_Task(): Handles incoming data from the network. (Libuv based.) > 3) Protocol_Task(): Does most of the work. (Message based.) > 4) DB_Task(): Does database insertions. (Message based.) > 5) Timer_Task(): Handles timer management. (Mostly Libuv based.) > > I can send you my server code and a test tool if it would be useful to you. > However, it may take a few days to put it in an easily useable form. > > As previously mentioned uv_accept() executes in the main() task and the > rest of the polling related Libuv API routines execute in the IO_Task(). > > Finally, remember the problem is intermittent. Generally, it takes about > 45 to 3 hours of constant concurrent connections for the crash to occur. > > Best Regards, > > Paul R. > > > > > On 05/01/2021 12:01 AM, Paul Romero wrote: > > Hi: > > I may have overlooked the following factor. > > We know that Libuv is not reentrant and not really multi-task safe. I > suspect, but > don't know, that the internal Libuv data structures attached to the > uv_loop_t structure, > Poll_Loop in my case, are not protected by internal Libuv Mutexes. > > 1) uv_accept() and uv_fileno() exectute in the main() thread. > > 2) uv_poll_init(), uv_poll_start() and uv_poll_stop(), execute in the > IO_Task() thread. > > That means uv_accept() and that the Libuv API routines in 2, above, can > execute concurrently. > Is it possible if such concurrency could damage the internal Libuv data > structures attached to the > uv_loop_t structure ? (Again Poll_Loop in my case.) > > Best Regards, > > Paul R. > > NOTE: I did a few experments and noticed that calling uv_close() in a > Mutex often > prevents its callback routine from executing. > > On 04/30/2021 04:59 PM, Paul Romero wrote: > > > Hi: > > I think I've isolated the problem and it appears to be in uv_poll_stop(). > > In particlular if uv_accept() is called immediately after uv_poll_stop(), > and it reallocates the > same socket descriptor just freed by uv_poll_stop(), then uv_poll_init() > fails with the error > EEXIST or EBADF. This must have something to do with uv_loop_t handle > passed to uv_poll_init(), > called Poll_Loop in my code. > > I had to instrument my code to keep track of Libuv API calls to see this. > > The following explanatory items follows. > > A - The sequence of calls that produces the problem. > B - Commented syslog() diagnostics. > C - The relevant code segments. > D - The gdb stack backtrace. > > Best Regards, > > Paul R. > > > A) The sequence of relevant calls is as follows and the problem only occur > with this exact > sequence of events. > > **************************************************************************************** > > 1) uv_poll_stop(): The uv_pool_t handle is bound to socket descriptor N. > This should free socket descriptor N. > 2) uv_accept(): Bound to socket desciptor N as set by uv_fileno(). > Socket descriptor N should be newly allocated. > 3) uv_poll_init(): Returns EBADF with socket descriptor N and > Poll_Loop.watchers[N] is NULL. > Something is wrong ! > > B) The commented syslog() diagnostic sequence from the instrumented code. > ************************************************************************** > * Stop Polling - Connection 1 Fd 21. This frees the FD. > > Apr 30 15:46:08 mobius pexd[30963]: POLL_CALL[i 1 fd 21] - IO_TASK - > uv_poll_stop() - T_POLL_DISMANTLE_REQ > > * Stop Polling - Connection 0 Fd 20. This frees the FD > > Apr 30 15:46:08 mobius pexd[30963]: POLL_CALL[i 0 fd 20] - IO_TASK - > uv_poll_stop() - T_POLL_DISMANTLE_REQ > > * New Connection - Connection 1 Fd 20. This is fine since uv_poll_stop() > was called for Fd 20 which > freed it and it is newly allcated with uv_accept() and obtained with > uv_fileno(). > > Apr 30 15:46:08 mobius pexd[30963]: POLL_CALL[i 1 fd 20] - IO_TASK - > uv_accept() + uv_fileno() > > * However we crash because uv_poll_init() returned EBADF ! > > Apr 30 15:46:08 mobius pexd[30963]: POLL_CALL[i 1 fd 20] - IO_TASK - > uv_poll_init() - T_POLL_CONFIG_REQ > Apr 30 15:46:08 mobius pexd[30963]: IO_TASK: POLL_PROXY - Polling > Initialization Error -9, EBADF > > C) The edited associated code segments are as follows. > ******************************************************** > 1) Prelude to uv_poll_stop() in poll_proxy() in the IO_Task() thread. > (network_io.c) > > ------------------------------------------------------------------------------------ > case T_POLL_DISMANTLE_REQ: > ppdata = (uv_handle_t **) &msg->buffer[0]; > poll_handle = (uv_poll_t *) *ppdata; > > syslog(LOG_INFO, "POLL_CALL[i %d fd %d] - IO_TASK - uv_poll_stop() - > T_POLL_DISMANTLE_REQ\n", cdesc->index, cdesc->fd); > > if( (r = uv_poll_stop(poll_handle)) ) > { > fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY - Poll Stop > Error %d, %s", > r, uv_err_name(r) ); > } > // > // The callback routine notifies the Protocol_Task() when > everything is done. > // > poll_handle->data = (void *) cdesc; > uv_close((uv_handle_t *) poll_handle, > async_close_callback); > > > 2) Connection acceptance in make_incoming_connection() in the main() > thread. (main.c) > > --------------------------------------------------------------------------------------- > conn_handle = (uv_tcp_t *) VALLOC(sizeof(uv_tcp_t)); > uv_tcp_init(&Connect_Loop, conn_handle); > > if((r = uv_accept(listen_handle, (uv_stream_t *) conn_handle)) == 0) > { > // > // A new connection occured. > // > uv_os_fd_t fd; > ENTER_MUTEX(&Protocol_D_Mutex); > nsock = N_Sockets++; > EXIT_MUTEX(&Protocol_D_Mutex); > // > // Fetch the socket descriptor from the connection handle. > // > uv_fileno((const uv_handle_t*) conn_handle, &fd); > // > // Allocate the connection descriptor. > // > cdesc = ALLOC_CONN_DESC(fd); > if( !cdesc ) > { > syslog(LOG_DEBUG, "MAIN_TASK: CONNECTION REJECTION - No > Connection Descriptors, N = %d", nsock); > > ENTER_MUTEX(&Protocol_D_Mutex); > N_Sockets--; > close(fd); > EXIT_MUTEX(&Protocol_D_Mutex); > uv_close((uv_handle_t *) conn_handle, forced_close_callback); > return; > } > > syslog(LOG_INFO, "POLL_CALL[i %d fd %d] - IO_TASK - uv_accept() + > uv_fileno()\n", cdesc->index, cdesc->fd); > // > // Save the connection handle and start polling. > // > cdesc->conn_handle = (uv_tcp_t *) conn_handle; > > syslog(LOG_INFO, "MAIN_TASK: NEW CONNECTION ESTABLISHED - CONN %d > FD %d", cdesc->index, cdesc->fd); > // > // Set up epoll() plumbing by sending a message to IO_Task(); > // > bzero((void *) &poll_data, sizeof(ASYNC_DATA)); > poll_data.type = T_POLL_CONFIG_REQ; > poll_data.async_handle = &IO_Task_Async_Handle; > poll_data.async_q = &IO_Task_Input_Q; > poll_data.object_handle = NULL; > // > // The T_POLL_CONFIG_RSP message will be sent to the > Protocol_Task() which > // is in S_IDLE state. > // > send_async_msg(cdesc, &poll_data); > } > > 3) Prelude to uv_poll_init() in poll_proxy() in the IO_Task() thread. > (network_io.c) > > ------------------------------------------------------------------------------------ > case T_POLL_CONFIG_REQ: > // > // Main process request to start Libuv/epoll() operation. > // > poll_handle = (uv_poll_t *) VALLOC(sizeof(uv_poll_t)); > > syslog(LOG_INFO, "POLL_CALL[i %d fd %d] - IO_TASK - uv_poll_init() - > T_POLL_CONFIG_REQ\n", cdesc->index, cdesc->fd); > if( (r = uv_poll_init(&Poll_Loop, poll_handle, cdesc->fd)) > != 0) > { > > fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY - Polling > Initialization Error %d, %s", > r, uv_err_name(r)); > } > > > > D) Stack backtrace. > > (gdb) bt > #0 0x00007f62c47b0267 in __GI_raise (sig=sig@entry=6) at > ../sysdeps/unix/sysv/linux/raise.c:55 > #1 0x00007f62c47b1eca in __GI_abort () at abort.c:89 > #2 0x000000000040d411 in fatal_error (log_prio=2, > fmt=0x424e28 "IO_TASK: POLL_PROXY - Polling Initialization Error %d, > %s") at misc.c:135 > #3 0x0000000000405b6d in poll_proxy (handle=0x641f80 > <IO_Task_Async_Handle>) at network_io.c:291 > #4 0x00000000004114cb in uv__async_io (loop=0xc26800 <Poll_Loop>, > w=<optimized out>, events=<optimized out>) > at src/unix/async.c:163 > #5 0x000000000041d255 in uv__io_poll (loop=loop@entry=0xc26800 > <Poll_Loop>, timeout=<optimized out>) > at src/unix/linux-core.c:462 > #6 0x0000000000411c48 in uv_run (loop=0xc26800 <Poll_Loop>, > mode=UV_RUN_DEFAULT) at src/unix/core.c:385 > #7 0x0000000000405835 in IO_Task (arg=0x0) at network_io.c:100 > #8 0x00007f62c4b4c6aa in start_thread (arg=0x7f62bbfff700) at > pthread_create.c:333 > #9 0x00007f62c4881eed in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 > > At this point cdesc->fd is 20 and Poll_Loop.watchers[20] is NULL. > On 04/30/2021 11:48 AM, Paul Romero wrote: > > Hi: > > My guess is that there is a subtle, probably timing related, problem in > uv_poll_stop() > simply because I can't think of any other relevant Libuv routines that are > explicitly > called after uv_poll_start() succeeds. What do you think ? Of course, > there could > be a problem in the way connections are rejected. All that code is in the > make_incoming_connection() > callback routine in main.c > > Best Regards, > > Paul R. > > > On 04/29/2021 03:14 PM, Paul Romero wrote: > > Hi: > > I reattached the following files with some confusing debugging stuff > removed and > a dump using them follows. > > * main.c > * network_io.c > * rx_fsm.c > * async.c > > Again this is the relevant code segment. > > case T_POLL_CONFIG_REQ: > // > // Main process request to start Libuv/epoll() operation. > // > poll_handle = (uv_poll_t *) VALLOC(sizeof(uv_poll_t)); > > if( (r = uv_poll_init(&Poll_Loop, poll_handle, cdesc->fd)) != 0) > { > > fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY - Polling > Initialization Error %d, %s", > r, uv_err_name(r)); > } > > This is the crash diagnostic. > > IO_TASK: POLL_PROXY - Polling Initialization Error -17, EEXIST > Aborted (core dumped) > > And the is the gdb backtrace and is typical of the problem. > > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > Core was generated by `./pexd'. > Program terminated with signal SIGABRT, Aborted. > #0 0x00007ff1da90b267 in __GI_raise (sig=sig@entry=6) at > ../sysdeps/unix/sysv/linux/raise.c:55 > 55 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. > (gdb) bt > #0 0x00007ff1da90b267 in __GI_raise (sig=sig@entry=6) at > ../sysdeps/unix/sysv/linux/raise.c:55 > #1 0x00007ff1da90ceca in __GI_abort () at abort.c:89 > #2 0x000000000040d379 in fatal_error (log_prio=2, > fmt=0x424ce0 "IO_TASK: POLL_PROXY - Polling Initialization Error %d, > %s") at misc.c:135 > #3 0x0000000000405b21 in poll_proxy (handle=0x641f80 > <IO_Task_Async_Handle>) at network_io.c:287 > #4 0x000000000041142b in uv__async_io (loop=0xc26800 <Poll_Loop>, > w=<optimized out>, events=<optimized out>) > at src/unix/async.c:163 > #5 0x000000000041d1b5 in uv__io_poll (loop=loop@entry=0xc26800 > <Poll_Loop>, timeout=<optimized out>) > at src/unix/linux-core.c:462 > #6 0x0000000000411ba8 in uv_run (loop=0xc26800 <Poll_Loop>, > mode=UV_RUN_DEFAULT) at src/unix/core.c:385 > #7 0x000000000040580f in IO_Task (arg=0x0) at network_io.c:100 > #8 0x00007ff1daca76aa in start_thread (arg=0x7ff1d5a7c700) at > pthread_create.c:333 > #9 0x00007ff1da9dceed in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 > > At this point cdesc->fd is 22 and Poll_Loop.watchers[22] = 0x17ef458, a > valid address ? > > On 04/29/2021 09:09 AM, Paul Romero wrote: > > Hi: > > The most common symptom of the problem is that uv_poll_init() fails which > means the problem must > be related to the uv_poll_t Poll_Loop. The relevant code segment is as > follows. > > poll_handle = (uv_poll_t *) VALLOC(sizeof(uv_poll_t)); > > if( (r = uv_poll_init(&Poll_Loop, poll_handle, cdesc->fd)) != 0) > { > fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY - Polling > Initialization Error %d, %s", > r, uv_err_name(r)); > } > > The following files illustrate the problem and are attached to this > message. Please advise me > how to attach files to the list directly. > > * trace_badf.txt: A stack trace when the error code is EBADF. > * trace_eexist.txt: A stack trace when the error code is EEXIST. > * network_io.c: The IO_Task() thread code. The problem occurs in the > poll_proxy() routine. > * main.c: The main() process/thread code where most initialization occurs. > * transmit.c: The Transmit_Task() thread. > * rx_fsm.c: The heart of the protocol task code. > > > You are right that only the IO_Task() performs polling related operations. > However, please note that the Transmit_Task() thread uses the underlying > TCP socket descriptor > to send data to the network with Linux send(). (See the TX_DATA() > routine.) Also, > USE_SSL and USE_SCRAM are not defined. > > I will send more information if am able to produce the problem at another > point. > Also, I have seen the issue you mentioned before and suspect it may be > relevant. > Please let me know if you need anything else. > > > Best Regards, > > Paul R. > > > > > > > Best Regards, > > Paul R. > > > On 04/28/2021 08:52 PM, Jameson Nash wrote: > > I'm assuming from your description that you've ensured all libuv calls > happen from exactly one thread (except uv_async_send). It sounds like > you're describing the issue that > <https://github.com/libuv/libuv/commit/c9406ba0e3d67907c1973a71968b89a6bd83c63c> > https://github.com/libuv/libuv/commit/c9406ba0e3d67907c1973a71968b89a6bd83c63c > was intended to fix, which was included in v1.41.0 > > Note that the Poll_Loop.watchers[N] is not always required to be NULL, so > perhaps you've found a case where it is still incorrectly expecting that? > You might want to file an issue with code to reproduce the problem, or at > least an `rr` trace dump of the code that hits problem (note that `rr` > could potentially upload all of your hard drive, so be careful what you are > submitting). > > On Wed, Apr 28, 2021 at 11:13 PM pa...@rcom-software.com < > pa...@rcom-software.com> wrote: > >> Hi Folks: >> >> I am experiencing an intermittent problem with uv_poll_init() after the >> successful establishment and release >> of multiple concurrent TCP connections. I am not sure if this problem is >> due to a bug, which may >> be corrected in another Libuv release, or if I am doing something wrong >> when releasing the poll handle. >> Do you have insight about the cause ? The details I consider most >> important follow. >> >> The uv_poll_t callback function reads incoming TCP data as follows. >> >> iff(events & UV_READABLE) >> { >> CONN_DESC *cdesc = (CONN_DESC *) poll_handle->data; >> n = recv(cdesc->fd, cdesc->rx_buf, RX_BUF_SIZE, MSG_DONTWAIT); >> >> NOTE: I am using Libuv version 1.41.0 running on Ubuntu Linux version >> 15.04. >> >> The problem is that uv_poll_init() fails, normally with the EEXIST or >> EBADF error code, and >> my investigation indicates the uv_loop_t Poll_Loop.watchers[N] field is >> not NULL when it should be, >> where N is TCP socket descriptor. It occurs immediately after the >> uv_poll_t poll_handle is allocated. >> (There is exactly one TCP socket descriptor per poll handle.) The call to >> uv_poll_init() is as follows >> and the socket descriptor is obtained with uv_fileno(). >> >> if( (r = uv_poll_init(&Poll_Loop, poll_handle, cdesc->fd)) != 0) >> { >> fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY - Polling >> Initialization Error %d, %s", >> r, uv_err_name(r)); >> } >> >> It occurs in the IO_Task() thread when there multiple TCP socket >> descriptors are in use. The IO_task >> releases the poll_handle with the following code sequence when it is >> notified that polling should stop >> via a Libuv async. message from the Protocol_Task() thread. >> >> if( (r = uv_poll_stop(poll_handle)) ) >> { >> fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY - Poll Stop Error %d, >> %s", >> r, uv_err_name(r) ); >> } >> >> poll_handle->data = (void *) cdesc; >> uv_close((uv_handle_t *) poll_handle, async_close_callback); >> >> The actual release occurs in async_close_callback() as follows and the >> Protocol_Task() releases the >> TCP socket decriptor with a normal Linux close() after it receives the >> T_DISMANTLE response message. >> >> VFREE((UCHAR *) handle); >> // >> // Send a notification message to the Protocol_Task. >> // >> msg = MSG_ALLOC(0, FALSE); >> msg->class = C_NOTIFY; >> msg->type = T_DISMANTLE_RSP; >> msg->info = 0; >> >> SendProtoMsg(cdesc, msg); >> >> I think the underlying cause is that if there is new TCP connection that >> uses a the same TCP socket descriptor >> as one released with uv_poll_stop() and uv_close(), the call to >> uv_poll_init() occurs before the socket closure >> has propogated into the uv_loop_t Poll_Handle. >> >> Best Regards, >> >> Paul Romero >> -- >> You received this message because you are subscribed to the Google Groups >> "libuv" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to <libuv+unsubscr...@googlegroups.com> >> libuv+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> <https://groups.google.com/d/msgid/libuv/c3290a76-ab6b-42ad-8540-33021c6188b9n%40googlegroups.com> >> https://groups.google.com/d/msgid/libuv/c3290a76-ab6b-42ad-8540-33021c6188b9n%40googlegroups.com >> . >> > -- > You received this message because you are subscribed to a topic in the > Google Groups "libuv" group. > To unsubscribe from this topic, visit > <https://groups.google.com/d/topic/libuv/_4ClQoaVPCg/unsubscribe> > https://groups.google.com/d/topic/libuv/_4ClQoaVPCg/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > libuv+unsubscr...@googlegroups.com. > To view this discussion on the web visit > <https://groups.google.com/d/msgid/libuv/CADnnjUXSsF7QwRnVqFojAm1W_o35CHKyMCPM%2BVw0FM0FjqN1XQ%40mail.gmail.com> > https://groups.google.com/d/msgid/libuv/CADnnjUXSsF7QwRnVqFojAm1W_o35CHKyMCPM%2BVw0FM0FjqN1XQ%40mail.gmail.com > . > > > -- > > > Paul Romero > ----------- > RCOM Communications Software > EMAIL: pa...@rcom-software.com > PHONE: (510)482-2769 > > > > > > > -- > > > Paul Romero > ----------- > RCOM Communications Software > EMAIL: pa...@rcom-software.com > PHONE: (510)482-2769 > > > > > > > -- > > > Paul Romero > ----------- > RCOM Communications Software > EMAIL: pa...@rcom-software.com > PHONE: (510)482-2769 > > > > > > -- > You received this message because you are subscribed to a topic in the > Google Groups "libuv" group. > To unsubscribe from this topic, visit > <https://groups.google.com/d/topic/libuv/_4ClQoaVPCg/unsubscribe> > https://groups.google.com/d/topic/libuv/_4ClQoaVPCg/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > libuv+unsubscr...@googlegroups.com. > To view this discussion on the web visit > <https://groups.google.com/d/msgid/libuv/608C50E8.7070202%40rcom-software.com> > https://groups.google.com/d/msgid/libuv/608C50E8.7070202%40rcom-software.com > . > > > -- > > > Paul Romero > ----------- > RCOM Communications Software > EMAIL: pa...@rcom-software.com > PHONE: (510)482-2769 > > > > > > > -- > > > Paul Romero > ----------- > RCOM Communications Software > EMAIL: pa...@rcom-software.com > PHONE: (510)482-2769 > > > > > > > -- > > > Paul Romero > ----------- > RCOM Communications Software > EMAIL: pa...@rcom-software.com > PHONE: (510)482-2769 > > > > > > -- > You received this message because you are subscribed to the Google Groups > "libuv" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to libuv+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/libuv/608E45B4.4050606%40rcom-software.com > <https://groups.google.com/d/msgid/libuv/608E45B4.4050606%40rcom-software.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "libuv" group. To unsubscribe from this group and stop receiving emails from it, send an email to libuv+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/libuv/CADnnjUW%2BMivauJHJm5kib0BxG8TCGSWSp%2BTcbi6CQ6nCrsrgPg%40mail.gmail.com.