On Wed, Mar 27, 2019 at 2:30 PM C. Scott Ananian <canan...@wikimedia.org>
wrote:

> Continuing this saga: I'm still having performance problems on character
> entity expansion.  Here's the baseline code:
> https://github.com/wikimedia/remex-html/blob/master/RemexHtml/Tokenizer/Tokenizer.php#L881
> Of note: the regular expression is quite large -- around 26kB -- because
> it needs to include the complete table of all HTML5 entities, which it gets
> from a separate file of tables, HTMLData.php.
>
> Recapping briefly: we established before that it is very important that
> large regex strings been interned, otherwise pcre_get_compiled_regex_cache
> needs to do a full zend_string_equal_content() on every call to
> preg_match*, and since the strings will match, that costs a complete
> traversal of the 26kB regexp string.
>
> If I inline the char ref table directly into the regexp as a single huge
> literal string, that string is interned and (with Nikita's recent fixes for
> the CLI) things are ok.
>
> But that's bad for code maintainability; it violates Do Not Repeat
> Yourself and now it's much harder to see what the character reference
> regexp is doing because it's got this huge 26k table embedded in the middle
> of it.
>
> PHP will let me initialize the string as:
>
> const CHAR_REF_REGEXP = ' ... ' . HTMLData::NAMED_ENTITY_REGEX . "...";
>
> that is, it recognizes this as a compile-time constant -- but it doesn't
> actually intern the resulting string.  The code in
> zend_declare_class_constant_ex interns *most* constant strings, but in this
> case because there is a reference to another constant, the Z_TYPE_P(value)
> == IS_STRING check in zend_declare_class_constant_ex fails (the actual type
> is IS_CONSTANT_AST) presumably because we don't want to autoload HTMLData
> too soon. (But this also seems to happen even if I use
> self::NAMED_ENTITY_REGEX here, which wouldn't trigger the autoloader.)
>
> I *think* the proper fix is to intern the string lazily when it is finally
> evaluated, in ZEND_FETCH_CLASS_CONSTANT_SPEC_CONST_CONST_HANDLER around the
> point where we check Z_TYPE_P(value) == IS_CONSTANT_AST -- probably by
> tweaking zval_update_constant_ex to intern any string result?
>

I've created https://github.com/php/php-src/pull/3994 implementing this
fix, and confirmed that it is sufficient to get my large regexp interned
when it is rewritten as a class constant referencing
HTMLData::NAMED_ENTITY_REGEX.
 --scott

Reply via email to