[Mediawiki-api] Question about summary regex in api for ML dataset

Daniel Kramer Fri, 12 Oct 2018 09:45:45 -0700

Hi,

I'm trying to create a dataset of summaries vs full text bodies for
automatic text summarization models.


I was looking at the online api for retrieving the summary of a page, so I
could recreate it in my Spark code for parsing wiki dumps. Specifically, I
was looking at the regex in:
https://phabricator.wikimedia.org/diffusion/ETEX/browse/master/includes/ApiQueryExtracts.php;012b89e966edf20834f0e551a66fbb4ebfd185cd$210

$regexp = '/^(.*?)(?=' . ExtractFormatter::SECTION_MARKER_START . ')/s';

With section marker start filled in:

$regexp = '/^(.*?)(?=' . \1\2 . ')/s';

However, when I plug that expression into an online tester (regex101.com),
I see that: \2 This token references a non-existent or invalid subpattern

I am wondering if this is a bug or if I'm placing it incorrectly?

The alternative branch is when plaintext is set to false - that's for
parsing HTML correct / not applicable for the xml in wiki dumps?

Thanks for your help,
Dan Kramer

_______________________________________________
Mediawiki-api mailing list
Mediawiki-api@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

[Mediawiki-api] Question about summary regex in api for ML dataset

Reply via email to