On Wed, Jul 13, 2022 at 1:38 PM Ivan Zhakov <i...@apache.org> wrote:
>
> On Sun, 10 Jul 2022 at 19:52, Yann Ylavic <ylavic....@gmail.com> wrote:
> >
> > Let me explain (maybe), I thought that PCRE2's match_data were not
> > only containing the captures vector but also all the frames needed for
> > matching not-too-complex regexes (at least the initial ones since they
> > will be reallocated on need). So the (supposed) 2KB looked necessary
> > to me for a few stack frames.
> > But the actual 256 bytes only made me look at the PCRE2 code again
> > (better this time) for how that can possibly work, and guess what: the
> > initial frames are allocated on the stack ([1]) still in PCRE2, and
> > that's 20KiB ([2]) of stack space for each call to pcre2_match().
> >
> > On Alpine Linux for instance (quite used in docker envs) there is only
> > 128KiB stack per thread, so I've always configured
> > --no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
> > vector cache, with custom pcre_malloc/free() to avoid performances
> > overhead and be able to run more complex regexes without exploding,
> > and I can't do that now with PCRE2 and without requiring 20KB of stack
> > :/
> > So I created a pull request ([3]).
> >
> The 20KB stack usage in PCRE is interesting, but somewhat unrelated. But
> I would suggest a different solution: it would be nice if PCRE provided
> a compile time directive to specify the maximum stack allocated buffer.
> So then distributions like Alpine Linux would specify more appropriate
> defaults.

This is what pcre2 folks proposed in [3] as a fast(ly acceptable)
solution indeed.
Say START_FRAMES_SIZE is set to some minimum, a single struct
heapframe with 24 captures, that's (on a 64bit system):
  offsetof(heapframe, ovector) + 24 * 2 * sizeof(size_t) = 125 + 384
~= 512 bytes
which would be much appreciated, but would also put much more pressure
on util_regex's private_malloc().
Will ap_regexec() start to use malloc() quite often with the current
implementation then?
Up to the old/default START_FRAMES_SIZE if it was a good default afterall?

But we are not here, because I don't think distros will change the
default pcre START_FRAMES_SIZE for all possible code they run, so most
httpd users (which run on distros I suppose) will continue to consume
20K of stack memory for every ap_regexec() call.

The flexible way for pcre2 users (and httpd maintainers) would be to
let them opt-out (at runtime) of stack allocation from pcre2_match(),
which is the patch I proposed.

>
> Creating match context with 20KB and caching it in thread local storage
> doesn't look like an appropriate solution for me, because it is just an
> increase of the per-thread memory overhead: the effective per-thread memory
> usage will be 128 KB for stack plus 20 KB for match context.

With no cache the other solutions are a big stack or a lot of
malloc()s, for some regexes (not so unreasonable in terms of captures
by the way). For instance mod_security might run some of them.
Not only can each pcre frame be big because of the number of captures,
but also the number of frames (not necessarily big ones) needed to run
a regex can be big because of the regex complexity (mod_security might
run some of them). Whatever the frame size is, pcre2 needs a new one
each time it "recurses" (though it jumps nowadays it seems, didn't
look at the jit code).

> >
> > > 3. access to tls storage requires hash lookup.
> >
> > Hmm no, util_pcre uses no hash lookup for instance
> > If one wants to handle TLS data with apr_thread_data_get(&data,
> > ap_current_thread()) then they will be stored in an apr_hash_t, but
> > there's nothing implicit. And by the way, using thread_data hashtable
> > won't be less efficient than registering as much native thread data
> > (e.g. pthread_key_t), I already mentioned that in another thread.
> >
> [[[
>     apr_thread_data_get((void **)&tls, "apreg", current);
>     if (!tls || tls->size < size) {
>         apr_pool_t *tp = apr_thread_pool_get(current);
>         if (!tls) {
>             tls = apr_pcalloc(tp, sizeof(*tls));
> #ifdef HAVE_PCRE2
>             apr_thread_data_set(tls, "apreg", apreg_tls_cleanup, current);
> #else
>             apr_thread_data_set(tls, "apreg", NULL, current);
> #endif
>         }
>
> ]]]
>
> apr_thread_data_get uses apr_pool_data_get() which uses apr hash.
[]
>
> Also, considering the big picture: while personally I appreciate the
> development effort, we are talking about a large and complex change that
> spans over multiple platforms and projects. It has the development cost,
> but also an implicit cost of maintaining and testing it. Considering that,
> I would say that the stackbuf approach is significantly simpler, local
> and has much less overall costs while providing similar characteristics.

I don't deny that you're implementation is superior when
!AP_HAS_THREAD_LOCAL || !ap_current_thread(), but which case is that?
A real case?
The httpd-2.4.54 implementation is optimized for the other case, the
most common one (i.e. platforms where usually performances count),
that's a choice.
AFAICT, the httpd-2.4.54 implementation is not performing worse than
the new trunk one in any case when ap_current_thread() != NULL.
So this is/was not so bad?

Anyway, please have a look at my proposed patch (elsethread) which
takes the best of both worlds.
This old vs new implementation discussion is kind of moot with this no?
I fixed things and tested it since posted (don't look too much in the
details this v0), comments on the approach welcome though.


Regards;
Yann.

Reply via email to