2010/1/27 aditya srinivas <[email protected]>: > Hello, > I am writing a Java program to extract the abstract of the wikipedia page > given the title of the wikipedia page. I have done some research and found > out that the abstract with be in rvsection=0 > So for example if I want the abstract of 'Eiffel Tower" wiki page then I am > querying using the api in the following way. > http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Eiffel%20Tower&rvprop=content&rvsection=0 > and parse the XML data which we get and take the wikitext in the tag <rev > xml:space="preserve"> which represents the abstract of the wikipedia page. > But this wiki text also contains the infobox data which I do not need. I > would like to know if there is anyway in which I can remove the infobox data > and get only the wikitext related to the page's abstract Or if there is any > alternative method by which I can get the abstract of the page directly. The software doesn't know what the abstract is, it just gives you everything up to the first == Header ==. You can try stripping out the infobox by stripping out everything between {{ and their matching }} (especially the matching part is tricky).
Roan Kattouw (Catrope) _______________________________________________ Mediawiki-api mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
