Hi, I'm trying to create a dataset of summaries vs full text bodies for automatic text summarization models.
I was looking at the online api for retrieving the summary of a page, so I could recreate it in my Spark code for parsing wiki dumps. Specifically, I was looking at the regex in: https://phabricator.wikimedia.org/diffusion/ETEX/browse/master/includes/ApiQueryExtracts.php;012b89e966edf20834f0e551a66fbb4ebfd185cd$210 $regexp = '/^(.*?)(?=' . ExtractFormatter::SECTION_MARKER_START . ')/s'; With section marker start filled in: $regexp = '/^(.*?)(?=' . \1\2 . ')/s'; However, when I plug that expression into an online tester (regex101.com), I see that: \2 This token references a non-existent or invalid subpattern I am wondering if this is a bug or if I'm placing it incorrectly? The alternative branch is when plaintext is set to false - that's for parsing HTML correct / not applicable for the xml in wiki dumps? Thanks for your help, Dan Kramer
_______________________________________________ Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api