On Wed, 13 Jul 2022 at 18:06, Yann Ylavic <ylavic....@gmail.com> wrote:

> On Wed, Jul 13, 2022 at 1:38 PM Ivan Zhakov <i...@apache.org> wrote:
> >
> > On Sun, 10 Jul 2022 at 19:52, Yann Ylavic <ylavic....@gmail.com> wrote:
> > >
> > > Let me explain (maybe), I thought that PCRE2's match_data were not
> > > only containing the captures vector but also all the frames needed for
> > > matching not-too-complex regexes (at least the initial ones since they
> > > will be reallocated on need). So the (supposed) 2KB looked necessary
> > > to me for a few stack frames.
> > > But the actual 256 bytes only made me look at the PCRE2 code again
> > > (better this time) for how that can possibly work, and guess what: the
> > > initial frames are allocated on the stack ([1]) still in PCRE2, and
> > > that's 20KiB ([2]) of stack space for each call to pcre2_match().
> > >
> > > On Alpine Linux for instance (quite used in docker envs) there is only
> > > 128KiB stack per thread, so I've always configured
> > > --no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
> > > vector cache, with custom pcre_malloc/free() to avoid performances
> > > overhead and be able to run more complex regexes without exploding,
> > > and I can't do that now with PCRE2 and without requiring 20KB of stack
> > > :/
> > > So I created a pull request ([3]).
> > >
> > The 20KB stack usage in PCRE is interesting, but somewhat unrelated. But
> > I would suggest a different solution: it would be nice if PCRE provided
> > a compile time directive to specify the maximum stack allocated buffer.
> > So then distributions like Alpine Linux would specify more appropriate
> > defaults.
>
> This is what pcre2 folks proposed in [3] as a fast(ly acceptable)
> solution indeed.
> Say START_FRAMES_SIZE is set to some minimum, a single struct
> heapframe with 24 captures, that's (on a 64bit system):
>   offsetof(heapframe, ovector) + 24 * 2 * sizeof(size_t) = 125 + 384
> ~= 512 bytes
> which would be much appreciated, but would also put much more pressure
> on util_regex's private_malloc().
> Will ap_regexec() start to use malloc() quite often with the current
> implementation then?
> Up to the old/default START_FRAMES_SIZE if it was a good default afterall?
>
> But we are not here, because I don't think distros will change the
> default pcre START_FRAMES_SIZE for all possible code they run, so most
> httpd users (which run on distros I suppose) will continue to consume
> 20K of stack memory for every ap_regexec() call.
>
> The flexible way for pcre2 users (and httpd maintainers) would be to
> let them opt-out (at runtime) of stack allocation from pcre2_match(),
> which is the patch I proposed.
>
> I see several problems in this approach:
1. Every apr_pool_t consumes at least one page, i.e., 4kb or 8kb depending
of the platform.
2. The implementation may introduce unbounded memory usage, if PCRE2
performs multiple malloc/free during pattern matching. It seems that
currently it doesn't, but we cannot rely on the implementation details.
Stackbuf code doesn't have this problem, because it's limited to a
fixed-size buffer.
3. As I said before, the proposed approach effectively increases per thread
memory usage: 4 or 8 kb for match data stored in thread state + stack size.
Which is something I don't really understand, because if we are fine with
that, then we can just increase the thread stack size.

With the above in mind, I'm -1 for this approach and r1902731.


>
> > Creating match context with 20KB and caching it in thread local storage
> > doesn't look like an appropriate solution for me, because it is just an
> > increase of the per-thread memory overhead: the effective per-thread
> memory
> > usage will be 128 KB for stack plus 20 KB for match context.
>
> With no cache the other solutions are a big stack or a lot of
> malloc()s, for some regexes (not so unreasonable in terms of captures
> by the way). For instance mod_security might run some of them.
> Not only can each pcre frame be big because of the number of captures,
> but also the number of frames (not necessarily big ones) needed to run
> a regex can be big because of the regex complexity (mod_security might
> run some of them). Whatever the frame size is, pcre2 needs a new one
> each time it "recurses" (though it jumps nowadays it seems, didn't
> look at the jit code).
>
> > >
> > > > 3. access to tls storage requires hash lookup.
> > >
> > > Hmm no, util_pcre uses no hash lookup for instance
> > > If one wants to handle TLS data with apr_thread_data_get(&data,
> > > ap_current_thread()) then they will be stored in an apr_hash_t, but
> > > there's nothing implicit. And by the way, using thread_data hashtable
> > > won't be less efficient than registering as much native thread data
> > > (e.g. pthread_key_t), I already mentioned that in another thread.
> > >
> > [[[
> >     apr_thread_data_get((void **)&tls, "apreg", current);
> >     if (!tls || tls->size < size) {
> >         apr_pool_t *tp = apr_thread_pool_get(current);
> >         if (!tls) {
> >             tls = apr_pcalloc(tp, sizeof(*tls));
> > #ifdef HAVE_PCRE2
> >             apr_thread_data_set(tls, "apreg", apreg_tls_cleanup,
> current);
> > #else
> >             apr_thread_data_set(tls, "apreg", NULL, current);
> > #endif
> >         }
> >
> > ]]]
> >
> > apr_thread_data_get uses apr_pool_data_get() which uses apr hash.
> []
> >
> > Also, considering the big picture: while personally I appreciate the
> > development effort, we are talking about a large and complex change that
> > spans over multiple platforms and projects. It has the development cost,
> > but also an implicit cost of maintaining and testing it. Considering
> that,
> > I would say that the stackbuf approach is significantly simpler, local
> > and has much less overall costs while providing similar characteristics.
>
> I don't deny that you're implementation is superior when
> !AP_HAS_THREAD_LOCAL || !ap_current_thread(), but which case is that?
> A real case?
> The httpd-2.4.54 implementation is optimized for the other case, the
> most common one (i.e. platforms where usually performances count),
> that's a choice.
> AFAICT, the httpd-2.4.54 implementation is not performing worse than
> the new trunk one in any case when ap_current_thread() != NULL.
> So this is/was not so bad?
>
> Anyway, please have a look at my proposed patch (elsethread) which
> takes the best of both worlds.
> This old vs new implementation discussion is kind of moot with this no?
> I fixed things and tested it since posted (don't look too much in the
> details this v0), comments on the approach welcome though.
>
>
> Regards;
> Yann.
>


-- 
Ivan Zhakov

Reply via email to