On Sun, 10 Jul 2022 at 19:52, Yann Ylavic <ylavic....@gmail.com> wrote:
>
> On Fri, Jul 8, 2022 at 7:15 PM Ivan Zhakov <i...@visualsvn.com> wrote:
> >
> > On Fri, 8 Jul 2022 at 19:58, Yann Ylavic <ylavic....@gmail.com> wrote:
> > >
> > > The main downside is:
> > >
> > > > +#ifndef AP_PCRE_STACKBUF_SIZE
> > > > +#define AP_PCRE_STACKBUF_SIZE (256)
> > > >  #endif
> > > []
> > > >  AP_DECLARE(int) ap_regexec(const ap_regex_t *preg, const char *string,
> > > >                             apr_size_t nmatch, ap_regmatch_t *pmatch,
> > > >                             int eflags)
> > > > -    int small_vector[POSIX_MALLOC_THRESHOLD * 3];
> > > > +    /* Use apr_uint64_t to get proper alignment. */
> > > > +    apr_uint64_t stack_buf[(AP_PCRE_STACKBUF_SIZE + 
> > > > sizeof(apr_uint64_t) - 1) / sizeof(apr_uint64_t)];
> > >
> > > So, if I count correctly, that's >2KB stack usage (regardless of PCRE1
> > > or PCRE2) compared to the 120B we had previously (for PCRE1 only)?
> > >
> > AP_PCRE_STACKBUF_SIZE is in bytes. And by default is 256 bytes. So
> > code allocates exactly 256 bytes on stack.
>
> Argh, so I can't count either.
>
> Let me explain (maybe), I thought that PCRE2's match_data were not
> only containing the captures vector but also all the frames needed for
> matching not-too-complex regexes (at least the initial ones since they
> will be reallocated on need). So the (supposed) 2KB looked necessary
> to me for a few stack frames.
> But the actual 256 bytes only made me look at the PCRE2 code again
> (better this time) for how that can possibly work, and guess what: the
> initial frames are allocated on the stack ([1]) still in PCRE2, and
> that's 20KiB ([2]) of stack space for each call to pcre2_match().
>
> On Alpine Linux for instance (quite used in docker envs) there is only
> 128KiB stack per thread, so I've always configured
> --no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
> vector cache, with custom pcre_malloc/free() to avoid performances
> overhead and be able to run more complex regexes without exploding,
> and I can't do that now with PCRE2 and without requiring 20KB of stack
> :/
> So I created a pull request ([3]).
>
The 20KB stack usage in PCRE is interesting, but somewhat unrelated. But
I would suggest a different solution: it would be nice if PCRE provided
a compile time directive to specify the maximum stack allocated buffer.
So then distributions like Alpine Linux would specify more appropriate
defaults.

Creating match context with 20KB and caching it in thread local storage
doesn't look like an appropriate solution for me, because it is just an
increase of the per-thread memory overhead: the effective per-thread memory
usage will be 128 KB for stack plus 20 KB for match context.

> > >
> > > Finally, I don't see how it's better than one allocation (or two) for
> > > the whole lifetime of a thread, given that thread_local is available
> > > on all the platforms (where performances matter), neither do I see
> > > what are your grievances against thread_local..
> > >
> > 1. tls storage is growing to maximum captures count. So it requires
> > more than one or two allocations.
>
> For the lifetime of a thread and the number of ap_regexec() it may
> call, that's all negligible.
>
> > 2. tls storage doesn't support dynamic MPM that uses native threads
>
> httpd-2.4 (and APR 1.8 unless reverted) provide the way to apr-ize a
> native thread so that ap_thread_current() is available.

I meant native OS thread pools.

>
> > 3. access to tls storage requires hash lookup.
>
> Hmm no, util_pcre uses no hash lookup for instance
> If one wants to handle TLS data with apr_thread_data_get(&data,
> ap_current_thread()) then they will be stored in an apr_hash_t, but
> there's nothing implicit. And by the way, using thread_data hashtable
> won't be less efficient than registering as much native thread data
> (e.g. pthread_key_t), I already mentioned that in another thread.
>
[[[
    apr_thread_data_get((void **)&tls, "apreg", current);
    if (!tls || tls->size < size) {
        apr_pool_t *tp = apr_thread_pool_get(current);
        if (!tls) {
            tls = apr_pcalloc(tp, sizeof(*tls));
#ifdef HAVE_PCRE2
            apr_thread_data_set(tls, "apreg", apreg_tls_cleanup, current);
#else
            apr_thread_data_set(tls, "apreg", NULL, current);
#endif
        }

]]]

apr_thread_data_get uses apr_pool_data_get() which uses apr hash.

> > 4. Windows build is currently broken due to tls storage API changes.
>
> Some Windows dev to fix the trivial compilation error (which by the
> way passed the 2.4.54 release process)?
> I'm not, though I am trying to update the code of all platforms when 
> possible..
>
There are more problems than trivial compilation error: now thread stack memory
is committed, but before all these changes memory was only reserved.

Also, considering the big picture: while personally I appreciate the
development effort, we are talking about a large and complex change that
spans over multiple platforms and projects. It has the development cost,
but also an implicit cost of maintaining and testing it. Considering that,
I would say that the stackbuf approach is significantly simpler, local
and has much less overall costs while providing similar characteristics.

> > 5. Using tls reserves the memory, even if pcre is not being used
> > afterwards. This can be thought as of implicitly increasing the
> > per-thread stack size.
>
> Which memory is reserved exactly for TLS?
Memory reserved for PCRE match context.

>
> > 6. With new code there will be no actual allocations in real-world cases.
>
> Well, there will be no allocation but the 20KB stack space, though
> that's true for the thread_local implementation too.
> Now if PCRE2 stops using stack frames by allowing users to reserve
> that space in match_data (like I propose in [3]), the thread_local
> implementation clearly wins thanks to the ~almost~ single allocation
> per thread, while your implementation would either use 20KB of stack
> or malloc()ate them for every pcre2_match() call.
> That'd be great if PCRE2 would allow for that, not to show that my
> implementation is better but because 20KB of stack for a single
> function in a lib is just insane (IMHO).
>
As per above, this change means effectively increasing the per-thread
memory usage by 20 KB. With the same effect we could just increase the
thread stack memory size.

---
Ivan Zhakov

Reply via email to