Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
Eric, The timestamp changed, but the content looks the same. -- Don Poitras - Host RSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513
Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
On Wed, Aug 14, 2019 at 9:52 AM Don Poitras wrote: > > Eric, > http://people.apache.org/~covener/trunk-proxy-segv.diff looks the same. Did > you place somewhere else? Sorry corrected now. (Adjusting to scp being disabled and dropped it in my actual home directory by accident.)
Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
Eric, http://people.apache.org/~covener/trunk-proxy-segv.diff looks the same. Did you place somewhere else? -- Don Poitras - Host RSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513
Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
On Tue, Aug 13, 2019 at 3:12 PM Don Poitras wrote: > You also moved the worker mutex unlock so it only happens if worker->s->hmax > != NULL Sorry, that was a bad edit, I refreshed the patch.
RE: [PATCH 63503] - Reverse proxy server - SIGSEGV
Eric, It's in the works. This involves several other people as the large server is owned by another group and getting our backend software ready to be called from httpd is somewhat involved. The build needs to be done by someone else as well. I've requested they build with --enable-pool-concurrency-check as that's what allows me to see which two threads are stepping on each other. I looked at your change and I see that you changed the logic so that the worker mutex is set for the first call to ap_proxy_initialize_worker(). I don't think that hurts anything and probably makes it easier to read, but I don't think it has an effect as in the first call (the one that allocates the pool), there is no way for any other thread to be using the pool. You also moved the worker mutex unlock so it only happens if worker->s->hmax != NULL and leave the two calls to connection_constructor() un-serialized. I don't have in my notes definite proof that this function must be serialized, so hopefully the test will show if this is ok. However, even in the case where worker->s->hmax == NULL, we need to unlock the worker mutex before returning. If this shows as a problem in the test, I can make that small change and run again and we should also put back the unlock call in the error condition for init_conn_pool(), as we shouldn't leave the mutex locked if we error out of this routine. Also, when this is finalized, all the new calls to APLOGNO need to go through your utility to assign message numbers. > -Original Message- > From: Eric Covener > Sent: Monday, August 12, 2019 3:25 PM > To: Apache HTTP Server Development List > Subject: Re: [PATCH 63503] - Reverse proxy server - SIGSEGV > > Hi Don, can you try this very similar patch? I applied yours to my > sandbox to get more context and made a few minor changes (including > pre-existing stuff that looked misleading) > > http://people.apache.org/~covener/trunk-proxy-segv.diff -- Don Poitras - Host RSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513
Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
On Mon, Aug 12, 2019 at 4:33 PM Don Poitras wrote: > > Eric, > I'm not sure what you're asking. The global mutex will cover the call to > init_conn_pool(), so there isn't a problem with worker->cp getting overlayed > if that's what you're thinking. I am thinking the opposite. Two threads interleaved in ap_proxy_initialize_worker could both think they're initializing the worker. When the 2nd one gets blocked halfway through in the global mutex, the first will have created worker->cp->pool, left, and begin using it (for the DNS calls). > The problem is that ap_proxy_initialize_worker() gets called repeatedly with > the same worker pointer. The first time it's called for a particular worker, > it creates a connection pool and the worker thread mutex. > But _every_ use of worker->cp->pool must be serialized. Yes, minus the direct constructor call that only creates a subpool. > Subsequent calls to ap_proxy_initialize_worker() with the same worker pointer > will use the pool and mutex created in the first call. It's only these calls > after the first that are at risk for concurrent use of the pool (hence my use > of the 'notfirst' flag). In the first call, I don't even bother getting the > proxy mutex as there isn't any way for other threads to be using the pool > before it's created. I think the first thread has to grab the thread mutex it just created if the global lock is not held earlier than it is today (to make checking the initialization state safer). Another route to safety is to create the mutex when the worker is defined since this is at startup rather than this lazy load but it is probably a riskier change.
RE: [PATCH 63503] - Reverse proxy server - SIGSEGV
Eric, I'm not sure what you're asking. The global mutex will cover the call to init_conn_pool(), so there isn't a problem with worker->cp getting overlayed if that's what you're thinking. The problem is that ap_proxy_initialize_worker() gets called repeatedly with the same worker pointer. The first time it's called for a particular worker, it creates a connection pool and the worker thread mutex. This mutex appears to only be used in ap_proxy_determine_connection() to protect setting of worker->cp->addr by apr_sockaddr_info_get() (it also protects use of worker->cp->pool, but it's not clear that was recognized as a requirement). But _every_ use of worker->cp->pool must be serialized. Subsequent calls to ap_proxy_initialize_worker() with the same worker pointer will use the pool and mutex created in the first call. It's only these calls after the first that are at risk for concurrent use of the pool (hence my use of the 'notfirst' flag). In the first call, I don't even bother getting the proxy mutex as there isn't any way for other threads to be using the pool before it's created. -- Don Poitras - Host RSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513
Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
Via inspection this looks quite sane. > On Aug 12, 2019, at 3:24 PM, Eric Covener wrote: > > Hi Don, can you try this very similar patch? I applied yours to my > sandbox to get more context and made a few minor changes (including > pre-existing stuff that looked misleading) > > http://people.apache.org/~covener/trunk-proxy-segv.diff > >
Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
Hi Don, can you try this very similar patch? I applied yours to my sandbox to get more context and made a few minor changes (including pre-existing stuff that looked misleading) http://people.apache.org/~covener/trunk-proxy-segv.diff On Mon, Aug 12, 2019 at 2:43 PM Eric Covener wrote: > > On Mon, Aug 12, 2019 at 2:37 PM Eric Covener wrote: > > > > On Mon, Aug 12, 2019 at 12:32 PM Don Poitras wrote: > > > > > > Eric, > > > The global mutex only serializes concurrent calls to > > > ap_proxy_initialize_worker(). The worker pool is also used when the > > > proxy_handler() is called from a thread kicked off from a _previous_ call > > > to ap_proxy_initialize_worker() . Turning on the pool concurrency check > > > shows this happening: > > > > > > pool concurrency check: pool 0xa04348(proxy_worker_cp), thread cur > > > 7f25a21fc700 in use by 7f2598bed700, state in use -> in use > > > > > > 172256 . Thawed libc:gsignal (+0x37) > > ><--- abort > > > 1 libc:gsignal (+0x37) > > > 2 libc:abort (+0x143) > > > 3 libapr-1:\apr_pools\pool_concurrency_abort 768 > > > 4 libapr-1:\apr_pools\apr_palloc 778 (+0xA) > > > 5 libapr-1:\thread_cond\apr_thread_cond_create 44 > > > 6 libaprutil-1:\apr_reslist\apr_reslist_create 299 (+0x9) > > > 7 mod_proxy:\proxy_util\ap_proxy_initialize_worker 2013 (+0x2F) > > > 8 mod_proxy:\mod_proxy\proxy_handler 1176 (+0xE) > > > 9 httpd:UNKNOWN at 0x004543A0 > > > > > > Here's the thead the diagnostic said had the pool 'in use' when > > > apr_palloc() was called: > > > > > > 172271Thawed libc:__GI_strncmp (+0x1680) > > > <-- in use > > > 1 libc:__GI_strncmp (+0x1680) > > > 2 libc:getenv (+0xBD) > > > 3 libc:__nscd_getai (+0x3D3) > > > 4 libc:gaih_inet.constprop.8 (+0x15F2) > > > 5 libc:getaddrinfo (+0x11F) > > > 6 libapr-1:\sockaddr\call_resolver 397 (+0x10) > > > 7 mod_proxy:\proxy_util\ap_proxy_determine_connection 2506 (+0x15) > > > 8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D) > > > 9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18) > > > 10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16) > > > 11 httpd:UNKNOWN at 0x004543A0 > > > > Thanks, I feel like I understand a bit more now at least about the > > apr_reslist_acquire() vs. DNS lookups cases. > > > > I guess in the above case ap_proxy_initialize_worker() has some funny > > business where the state is checked outside of the global mutex and if > > a 2nd thread sneaks in they will reuse worker->cp->pool > > Whoops, gmail ate a bunch of keystrokes and sent prematurely when I > tried to paste.: > > if (worker->cp == NULL) > init_conn_pool(p, worker); > > But then will create a new reslist with the old pool that is already > being used in thread 1 that has gotten farther into the handler. > I don't totally follow how the patch in BZ is fixing this one if > that's what's happening. > > -- > Eric Covener > cove...@gmail.com -- Eric Covener cove...@gmail.com
Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
On Mon, Aug 12, 2019 at 2:37 PM Eric Covener wrote: > > On Mon, Aug 12, 2019 at 12:32 PM Don Poitras wrote: > > > > Eric, > > The global mutex only serializes concurrent calls to > > ap_proxy_initialize_worker(). The worker pool is also used when the > > proxy_handler() is called from a thread kicked off from a _previous_ call > > to ap_proxy_initialize_worker() . Turning on the pool concurrency check > > shows this happening: > > > > pool concurrency check: pool 0xa04348(proxy_worker_cp), thread cur > > 7f25a21fc700 in use by 7f2598bed700, state in use -> in use > > > > 172256 . Thawed libc:gsignal (+0x37) > > <--- abort > > 1 libc:gsignal (+0x37) > > 2 libc:abort (+0x143) > > 3 libapr-1:\apr_pools\pool_concurrency_abort 768 > > 4 libapr-1:\apr_pools\apr_palloc 778 (+0xA) > > 5 libapr-1:\thread_cond\apr_thread_cond_create 44 > > 6 libaprutil-1:\apr_reslist\apr_reslist_create 299 (+0x9) > > 7 mod_proxy:\proxy_util\ap_proxy_initialize_worker 2013 (+0x2F) > > 8 mod_proxy:\mod_proxy\proxy_handler 1176 (+0xE) > > 9 httpd:UNKNOWN at 0x004543A0 > > > > Here's the thead the diagnostic said had the pool 'in use' when > > apr_palloc() was called: > > > > 172271Thawed libc:__GI_strncmp (+0x1680) > > <-- in use > > 1 libc:__GI_strncmp (+0x1680) > > 2 libc:getenv (+0xBD) > > 3 libc:__nscd_getai (+0x3D3) > > 4 libc:gaih_inet.constprop.8 (+0x15F2) > > 5 libc:getaddrinfo (+0x11F) > > 6 libapr-1:\sockaddr\call_resolver 397 (+0x10) > > 7 mod_proxy:\proxy_util\ap_proxy_determine_connection 2506 (+0x15) > > 8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D) > > 9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18) > > 10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16) > > 11 httpd:UNKNOWN at 0x004543A0 > > Thanks, I feel like I understand a bit more now at least about the > apr_reslist_acquire() vs. DNS lookups cases. > > I guess in the above case ap_proxy_initialize_worker() has some funny > business where the state is checked outside of the global mutex and if > a 2nd thread sneaks in they will reuse worker->cp->pool Whoops, gmail ate a bunch of keystrokes and sent prematurely when I tried to paste.: if (worker->cp == NULL) init_conn_pool(p, worker); But then will create a new reslist with the old pool that is already being used in thread 1 that has gotten farther into the handler. I don't totally follow how the patch in BZ is fixing this one if that's what's happening. -- Eric Covener cove...@gmail.com
Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
On Mon, Aug 12, 2019 at 12:32 PM Don Poitras wrote: > > Eric, > The global mutex only serializes concurrent calls to > ap_proxy_initialize_worker(). The worker pool is also used when the > proxy_handler() is called from a thread kicked off from a _previous_ call to > ap_proxy_initialize_worker() . Turning on the pool concurrency check shows > this happening: > > pool concurrency check: pool 0xa04348(proxy_worker_cp), thread cur > 7f25a21fc700 in use by 7f2598bed700, state in use -> in use > > 172256 . Thawed libc:gsignal (+0x37) ><--- abort > 1 libc:gsignal (+0x37) > 2 libc:abort (+0x143) > 3 libapr-1:\apr_pools\pool_concurrency_abort 768 > 4 libapr-1:\apr_pools\apr_palloc 778 (+0xA) > 5 libapr-1:\thread_cond\apr_thread_cond_create 44 > 6 libaprutil-1:\apr_reslist\apr_reslist_create 299 (+0x9) > 7 mod_proxy:\proxy_util\ap_proxy_initialize_worker 2013 (+0x2F) > 8 mod_proxy:\mod_proxy\proxy_handler 1176 (+0xE) > 9 httpd:UNKNOWN at 0x004543A0 > > Here's the thead the diagnostic said had the pool 'in use' when apr_palloc() > was called: > > 172271Thawed libc:__GI_strncmp (+0x1680) > <-- in use > 1 libc:__GI_strncmp (+0x1680) > 2 libc:getenv (+0xBD) > 3 libc:__nscd_getai (+0x3D3) > 4 libc:gaih_inet.constprop.8 (+0x15F2) > 5 libc:getaddrinfo (+0x11F) > 6 libapr-1:\sockaddr\call_resolver 397 (+0x10) > 7 mod_proxy:\proxy_util\ap_proxy_determine_connection 2506 (+0x15) > 8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D) > 9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18) > 10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16) > 11 httpd:UNKNOWN at 0x004543A0 Thanks, I feel like I understand a bit more now at least about the apr_reslist_acquire() vs. DNS lookups cases. I guess in the above case ap_proxy_initialize_worker() has some funny business where the state is checked outside of the global mutex and if a 2nd thread sneaks in they will reuse worker->cp->pool
RE: [PATCH 63503] - Reverse proxy server - SIGSEGV
Eric, The global mutex only serializes concurrent calls to ap_proxy_initialize_worker(). The worker pool is also used when the proxy_handler() is called from a thread kicked off from a _previous_ call to ap_proxy_initialize_worker() . Turning on the pool concurrency check shows this happening: pool concurrency check: pool 0xa04348(proxy_worker_cp), thread cur 7f25a21fc700 in use by 7f2598bed700, state in use -> in use 172256 . Thawed libc:gsignal (+0x37) <--- abort 1 libc:gsignal (+0x37) 2 libc:abort (+0x143) 3 libapr-1:\apr_pools\pool_concurrency_abort 768 4 libapr-1:\apr_pools\apr_palloc 778 (+0xA) 5 libapr-1:\thread_cond\apr_thread_cond_create 44 6 libaprutil-1:\apr_reslist\apr_reslist_create 299 (+0x9) 7 mod_proxy:\proxy_util\ap_proxy_initialize_worker 2013 (+0x2F) 8 mod_proxy:\mod_proxy\proxy_handler 1176 (+0xE) 9 httpd:UNKNOWN at 0x004543A0 Here's the thead the diagnostic said had the pool 'in use' when apr_palloc() was called: 172271Thawed libc:__GI_strncmp (+0x1680) <-- in use 1 libc:__GI_strncmp (+0x1680) 2 libc:getenv (+0xBD) 3 libc:__nscd_getai (+0x3D3) 4 libc:gaih_inet.constprop.8 (+0x15F2) 5 libc:getaddrinfo (+0x11F) 6 libapr-1:\sockaddr\call_resolver 397 (+0x10) 7 mod_proxy:\proxy_util\ap_proxy_determine_connection 2506 (+0x15) 8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D) 9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18) 10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16) 11 httpd:UNKNOWN at 0x004543A0 Another possible fix would be to use the global mutex everywhere it's needed and remove the call to create and use the worker thread lock altogether, but that would mean changing the code that currently uses the worker proxy thread lock and it seemed cleaner to just change proxy_util.c rather than two source modules. The two functions in ap_proxy_acquire_connection() that I serialized were found from other runs. Both apr_reslist_acquire() and connection_constructor() end up using the same worker pool that is being used by ap_proxy_determine_connection(). e.g. pool concurrency check: pool 0x1d93198(proxy_worker_cp), thread cur 7f019bfff700 in use by 7f019abfd700, state in use -> in use aborting thread: 1Thread 0x7f019bfff700 (LWP 106622) 0x7f01a735c207 in raise () from /lib64/libc.so.6 3 libapr-1:\apr_pools\pool_concurrency_abort 768 4 libapr-1:\apr_pools\apr_palloc 778 (+0xA) 5 libaprutil-1:\apr_reslist\get_container 100 (+0x8) 6 libaprutil-1:\apr_reslist\apr_reslist_acquire 121 (+0x3) 7 mod_proxy:\proxy_util\ap_proxy_acquire_connection 2327 (+0x3) 8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1933 (+0x19) 9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18) 10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16) 11 httpd:UNKNOWN at 0x004543A0 And the "in-use" thread: 5Thread 0x7f019abfd700 (LWP 106624) 0x7f01a7424bed in connect () from /lib64/libc.so.6 8 mod_proxy:\proxy_util\ap_proxy_determine_connection 2537 (+0x15) 9 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D) 10 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18) 11 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16) Once the workers are up and running, they use thread-specific storage, but all the code that starts them up uses a pool that needs to be serialized. worker->cp as used in ap_proxy_acquire_connection() is currently not serialized when running with the worker mpm. It can run at the same time as ap_proxy_initialize_worker() with the same pool used in both. The problem doesn't show up when running the same test with pre-fork or event mpms. -- Don Poitras - Host RSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513
Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
On Sun, Aug 11, 2019 at 9:53 AM Don Poitras wrote: > > Hello, > I see that proxy_util.c was updated a few days ago. Would it be considered > a faux-pas to contact the developer directly to request a review of this > patch? It's been almost 2 months and I'm worried it will just get lost in the > noise. Probably not so effective but following up here is OK. I am a bit curious if what's really happening on your system is a failure in an existing mutex that goes unreported, because it seems that the new use of the thread mutex is redundant due to either the global mutex that's held or the mutex used internally by apr_reslist_*(). IIUC the unlocked access to the worker->cp in e.g. ap_proxy_acquire_connection() is a path only used in single-threaded MPMS. If you still have your testbed, it would be interesting to see which lock is necessary and if any of the existing mutexes are silently failing. -- Eric Covener cove...@gmail.com
RE: [PATCH 63503] - Reverse proxy server - SIGSEGV
Hello, I see that proxy_util.c was updated a few days ago. Would it be considered a faux-pas to contact the developer directly to request a review of this patch? It's been almost 2 months and I'm worried it will just get lost in the noise. -- Don Poitras - Host RSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513
[PATCH 63503] - Reverse proxy server - SIGSEGV
Hello. I submitted this patch last week and I'd just like to solicit comments on how to proceed. See this for the text of the bug and fix: https://bz.apache.org/bugzilla/show_bug.cgi?id=63503 Unfortunately, I don't have a test that exhibits the bug that I can ask others to run. We were only able to see it when running a Loadrunner script emulating thousands of concurrent users. As I see it, there are two potential problems in the fix: 1) adding mutex() calls will single-thread code that previously was able to run on multiple processors and so imposes a performance hit. 2) There's a possibility of "deadly embrace" if some code path uses the locks in an order not seen in our tests. I think #2 would become evident pretty quickly if we were able to get a test run that exercises the mod_proxy code more fully than our one server. #1 I think is just overhead required for thread-safety. It would be nice if all apr-pools were required to only be accessed by a single thread, but if they aren't, then they must be serialized externally when there is the possibility of concurrent access. The presence of the "--enable-pool-concurrency-check" made fixing this problem a lot easier. I couldn't find a previous report of its use, but obviously someone thought it would be helpful. Let me know if there are questions I can answer about the patch or the core files that the bug produced. Thanks! -- Don Poitras - Host RSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513