On Wed, Mar 27, 2019 at 2:30 PM C. Scott Ananian <canan...@wikimedia.org> wrote:
> Continuing this saga: I'm still having performance problems on character > entity expansion. Here's the baseline code: > https://github.com/wikimedia/remex-html/blob/master/RemexHtml/Tokenizer/Tokenizer.php#L881 > Of note: the regular expression is quite large -- around 26kB -- because > it needs to include the complete table of all HTML5 entities, which it gets > from a separate file of tables, HTMLData.php. > > Recapping briefly: we established before that it is very important that > large regex strings been interned, otherwise pcre_get_compiled_regex_cache > needs to do a full zend_string_equal_content() on every call to > preg_match*, and since the strings will match, that costs a complete > traversal of the 26kB regexp string. > > If I inline the char ref table directly into the regexp as a single huge > literal string, that string is interned and (with Nikita's recent fixes for > the CLI) things are ok. > > But that's bad for code maintainability; it violates Do Not Repeat > Yourself and now it's much harder to see what the character reference > regexp is doing because it's got this huge 26k table embedded in the middle > of it. > > PHP will let me initialize the string as: > > const CHAR_REF_REGEXP = ' ... ' . HTMLData::NAMED_ENTITY_REGEX . "..."; > > that is, it recognizes this as a compile-time constant -- but it doesn't > actually intern the resulting string. The code in > zend_declare_class_constant_ex interns *most* constant strings, but in this > case because there is a reference to another constant, the Z_TYPE_P(value) > == IS_STRING check in zend_declare_class_constant_ex fails (the actual type > is IS_CONSTANT_AST) presumably because we don't want to autoload HTMLData > too soon. (But this also seems to happen even if I use > self::NAMED_ENTITY_REGEX here, which wouldn't trigger the autoloader.) > > I *think* the proper fix is to intern the string lazily when it is finally > evaluated, in ZEND_FETCH_CLASS_CONSTANT_SPEC_CONST_CONST_HANDLER around the > point where we check Z_TYPE_P(value) == IS_CONSTANT_AST -- probably by > tweaking zval_update_constant_ex to intern any string result? > I've created https://github.com/php/php-src/pull/3994 implementing this fix, and confirmed that it is sufficient to get my large regexp interned when it is rewritten as a class constant referencing HTMLData::NAMED_ENTITY_REGEX. --scott