Re: [Mediawiki-api] Need to extract abstract of a wikipedia page

Roan Kattouw Wed, 27 Jan 2010 12:59:29 -0800

2010/1/27 aditya srinivas <[email protected]>:
> Hello,
> I am writing a Java program to extract the abstract of the wikipedia page
> given the title of the wikipedia page. I have done some research and found
> out that the abstract with be in rvsection=0
>  So for example if I want the abstract of 'Eiffel Tower" wiki page then I am
> querying using the api in the following way.
> http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Eiffel%20Tower&rvprop=content&rvsection=0
> and parse the XML data which we get and take the wikitext in the tag <rev
> xml:space="preserve">  which represents the abstract of the wikipedia page.
> But this wiki text also contains the infobox data which I do not need. I
> would like to know if there is anyway in which I can remove the infobox data
> and get only the wikitext related to the page's abstract Or if there is any
> alternative method by which I can get the abstract of the page directly.
The software doesn't know what the abstract is, it just gives you
everything up to the first == Header ==. You can try stripping out the
infobox by stripping out everything between {{ and their matching }}
(especially the matching part is tricky).


Roan Kattouw (Catrope)

_______________________________________________
Mediawiki-api mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Re: [Mediawiki-api] Need to extract abstract of a wikipedia page

Reply via email to