On Wed, 13 Jul 2022 at 18:06, Yann Ylavic <ylavic....@gmail.com> wrote:
> On Wed, Jul 13, 2022 at 1:38 PM Ivan Zhakov <i...@apache.org> wrote: > > > > On Sun, 10 Jul 2022 at 19:52, Yann Ylavic <ylavic....@gmail.com> wrote: > > > > > > Let me explain (maybe), I thought that PCRE2's match_data were not > > > only containing the captures vector but also all the frames needed for > > > matching not-too-complex regexes (at least the initial ones since they > > > will be reallocated on need). So the (supposed) 2KB looked necessary > > > to me for a few stack frames. > > > But the actual 256 bytes only made me look at the PCRE2 code again > > > (better this time) for how that can possibly work, and guess what: the > > > initial frames are allocated on the stack ([1]) still in PCRE2, and > > > that's 20KiB ([2]) of stack space for each call to pcre2_match(). > > > > > > On Alpine Linux for instance (quite used in docker envs) there is only > > > 128KiB stack per thread, so I've always configured > > > --no-recurse-on-stack with PCRE1 and used per-thread (and its pool) > > > vector cache, with custom pcre_malloc/free() to avoid performances > > > overhead and be able to run more complex regexes without exploding, > > > and I can't do that now with PCRE2 and without requiring 20KB of stack > > > :/ > > > So I created a pull request ([3]). > > > > > The 20KB stack usage in PCRE is interesting, but somewhat unrelated. But > > I would suggest a different solution: it would be nice if PCRE provided > > a compile time directive to specify the maximum stack allocated buffer. > > So then distributions like Alpine Linux would specify more appropriate > > defaults. > > This is what pcre2 folks proposed in [3] as a fast(ly acceptable) > solution indeed. > Say START_FRAMES_SIZE is set to some minimum, a single struct > heapframe with 24 captures, that's (on a 64bit system): > offsetof(heapframe, ovector) + 24 * 2 * sizeof(size_t) = 125 + 384 > ~= 512 bytes > which would be much appreciated, but would also put much more pressure > on util_regex's private_malloc(). > Will ap_regexec() start to use malloc() quite often with the current > implementation then? > Up to the old/default START_FRAMES_SIZE if it was a good default afterall? > > But we are not here, because I don't think distros will change the > default pcre START_FRAMES_SIZE for all possible code they run, so most > httpd users (which run on distros I suppose) will continue to consume > 20K of stack memory for every ap_regexec() call. > > The flexible way for pcre2 users (and httpd maintainers) would be to > let them opt-out (at runtime) of stack allocation from pcre2_match(), > which is the patch I proposed. > > I see several problems in this approach: 1. Every apr_pool_t consumes at least one page, i.e., 4kb or 8kb depending of the platform. 2. The implementation may introduce unbounded memory usage, if PCRE2 performs multiple malloc/free during pattern matching. It seems that currently it doesn't, but we cannot rely on the implementation details. Stackbuf code doesn't have this problem, because it's limited to a fixed-size buffer. 3. As I said before, the proposed approach effectively increases per thread memory usage: 4 or 8 kb for match data stored in thread state + stack size. Which is something I don't really understand, because if we are fine with that, then we can just increase the thread stack size. With the above in mind, I'm -1 for this approach and r1902731. > > > Creating match context with 20KB and caching it in thread local storage > > doesn't look like an appropriate solution for me, because it is just an > > increase of the per-thread memory overhead: the effective per-thread > memory > > usage will be 128 KB for stack plus 20 KB for match context. > > With no cache the other solutions are a big stack or a lot of > malloc()s, for some regexes (not so unreasonable in terms of captures > by the way). For instance mod_security might run some of them. > Not only can each pcre frame be big because of the number of captures, > but also the number of frames (not necessarily big ones) needed to run > a regex can be big because of the regex complexity (mod_security might > run some of them). Whatever the frame size is, pcre2 needs a new one > each time it "recurses" (though it jumps nowadays it seems, didn't > look at the jit code). > > > > > > > > 3. access to tls storage requires hash lookup. > > > > > > Hmm no, util_pcre uses no hash lookup for instance > > > If one wants to handle TLS data with apr_thread_data_get(&data, > > > ap_current_thread()) then they will be stored in an apr_hash_t, but > > > there's nothing implicit. And by the way, using thread_data hashtable > > > won't be less efficient than registering as much native thread data > > > (e.g. pthread_key_t), I already mentioned that in another thread. > > > > > [[[ > > apr_thread_data_get((void **)&tls, "apreg", current); > > if (!tls || tls->size < size) { > > apr_pool_t *tp = apr_thread_pool_get(current); > > if (!tls) { > > tls = apr_pcalloc(tp, sizeof(*tls)); > > #ifdef HAVE_PCRE2 > > apr_thread_data_set(tls, "apreg", apreg_tls_cleanup, > current); > > #else > > apr_thread_data_set(tls, "apreg", NULL, current); > > #endif > > } > > > > ]]] > > > > apr_thread_data_get uses apr_pool_data_get() which uses apr hash. > [] > > > > Also, considering the big picture: while personally I appreciate the > > development effort, we are talking about a large and complex change that > > spans over multiple platforms and projects. It has the development cost, > > but also an implicit cost of maintaining and testing it. Considering > that, > > I would say that the stackbuf approach is significantly simpler, local > > and has much less overall costs while providing similar characteristics. > > I don't deny that you're implementation is superior when > !AP_HAS_THREAD_LOCAL || !ap_current_thread(), but which case is that? > A real case? > The httpd-2.4.54 implementation is optimized for the other case, the > most common one (i.e. platforms where usually performances count), > that's a choice. > AFAICT, the httpd-2.4.54 implementation is not performing worse than > the new trunk one in any case when ap_current_thread() != NULL. > So this is/was not so bad? > > Anyway, please have a look at my proposed patch (elsethread) which > takes the best of both worlds. > This old vs new implementation discussion is kind of moot with this no? > I fixed things and tested it since posted (don't look too much in the > details this v0), comments on the approach welcome though. > > > Regards; > Yann. > -- Ivan Zhakov