Hi Platonides,

Thanks so much for your reply.

That makes a lot more sense - unfortunately,  I can't seem to find section
names as elements in the xml schema (
https://www.mediawiki.org/xml/export-0.10.xsd). Do you have any
recommendations for parsing the intro section out of the xml dumps? Trying
to avoid parsing html or querying the api because I have Cloud9's wiki xml
reader for processing the xml dumps in spark.

Thanks again,
Dan

On Fri, Oct 12, 2018 at 1:00 PM Platonides <platoni...@gmail.com> wrote:

> That \1\2 are literal bytes. You would do:
>
> $regexp = '/^(.*?)(?=\x01\x02)/s';
>
> But those bytes are not present in the original wikitext, they are set
> by ExtractFormatter
>                         $html = preg_replace( '/\s*(<h([1-6])\b)/i',
>                                         "\n\n" .
> self::SECTION_MARKER_START . '$2' .
> self::SECTION_MARKER_END . '$1',
>                                         $html);
>
> Best regards
>
>
> PS: These are section names, not edit summaries.
>
> _______________________________________________
> Mediawiki-api mailing list
> Mediawiki-api@lists.wikimedia.org
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_mailman_listinfo_mediawiki-2Dapi&d=DwIGaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=v6T2EyE4KveT7ULVWpZKEQ&m=3ifKD97b-oU21yT3FsgrNa_MYPjLADy0HJTfStT5SoQ&s=mHorhY1TsQMyQABupg-HuaEIRMc8ZKmX3zhn9u1o0a4&e=
>
_______________________________________________
Mediawiki-api mailing list
Mediawiki-api@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Reply via email to