Hi Platonides, Thanks so much for your reply.
That makes a lot more sense - unfortunately, I can't seem to find section names as elements in the xml schema ( https://www.mediawiki.org/xml/export-0.10.xsd). Do you have any recommendations for parsing the intro section out of the xml dumps? Trying to avoid parsing html or querying the api because I have Cloud9's wiki xml reader for processing the xml dumps in spark. Thanks again, Dan On Fri, Oct 12, 2018 at 1:00 PM Platonides <platoni...@gmail.com> wrote: > That \1\2 are literal bytes. You would do: > > $regexp = '/^(.*?)(?=\x01\x02)/s'; > > But those bytes are not present in the original wikitext, they are set > by ExtractFormatter > $html = preg_replace( '/\s*(<h([1-6])\b)/i', > "\n\n" . > self::SECTION_MARKER_START . '$2' . > self::SECTION_MARKER_END . '$1', > $html); > > Best regards > > > PS: These are section names, not edit summaries. > > _______________________________________________ > Mediawiki-api mailing list > Mediawiki-api@lists.wikimedia.org > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_mailman_listinfo_mediawiki-2Dapi&d=DwIGaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=v6T2EyE4KveT7ULVWpZKEQ&m=3ifKD97b-oU21yT3FsgrNa_MYPjLADy0HJTfStT5SoQ&s=mHorhY1TsQMyQABupg-HuaEIRMc8ZKmX3zhn9u1o0a4&e= >
_______________________________________________ Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api