Also ignore lines starting with "#", ":", " " (space), or ";" .

Then there are (potentially nested) tables, which start with a line
beginning with "{|" and end in a line beginning with "|}".

There are more "magic words" with the general pattern
"__SOMEUPPERCASECHARACTERS__", IIRC.

Note that sometimes, people start the paragraph after a closure that
should be alone, such as
|} First line of text

[[ and ]] pairs should not extend over a line, but they can be nested,
e.g. for images.

Or, and there are HTML comments to remove, and <nowiki>...</nowiki>

That's all I can come up with right now...

Magnus



On Fri, Aug 6, 2010 at 4:07 PM, nevio carlos de alarcão
<[email protected]> wrote:
> If you are to extract only Wikipedia'a articles first paragraph no problema.
>
> 2010/8/6 Katharina Wolkwitz <[email protected]>
>
>> Hi,
>>
>> Am 05.08.2010 16:47 schrieb lmhelp2:
>> >
>> > Thank you!
>> >
>> > So here is the list I have for the moment:
>> > I need to ignore lines:
>> > - containing: {{...}}
>> >           => possibly spreading over several lines,
>> >           => being possibly nested {{... {{ ... }} ... }}.
>> > - containing: [[...]]
>> >           => being possibly nested [[... [[ ... ]] ... ]].
>> > - equal to: __TOC__
>> > - equal to: __NOTOC__
>> > - beginning with the '=' character
>> > - beginning with the '*' character
>> I don't think you should ignore lines beginning with the '*' character -
>> those
>> may include the wanted first paragraph of the text as the '*' is just a way
>> of
>> formatting the page...
>>
>> Greetings
>> Katharina
>>
>> _______________________________________________
>> MediaWiki-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>>
>
>
>
> --
> {+}Nevinho
> Venha para o Movimento Colaborativo http://sextapoetica.com.br !!
> _______________________________________________
> MediaWiki-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>

_______________________________________________
MediaWiki-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Reply via email to