Jeffrey Barish schreef: > On Saturday 21 March 2009 16:15:54 Platonides wrote: >> Jeffrey Barish wrote: >>> I am writing a PyGTK application. I would like to be able to download >>> text only (with formatting) from Wikipedia and display it in my >>> application. I think that I am close to a solution, but I have reached >>> an impasse due to my ignorance of most of the mediawiki API. >>> >>> My plan has been to use GtkMozembed in my application to render the page, >>> so I need to retrieve html. What is close to working is to use the >>> index.php API with action=render and title=<search string for the >>> Wikipedia page>. The data that I retrieve does display in my browser, >>> but it has the following undesired characteristics: >>> >>> >>> 2. There are sections at the end that I don't want (Further reading, >>> External links, Notes, See also, References). >> Those sections are part of the content. The API doesn't have any >> parameter to include/exclude them. >> >>> 1. All images appear (I want none). >> Same issue. Although it's easier to replace, remove /<img.*?>/ > > It seems that images appear in <div class="thumbcaption"></div> blocks. > Would > you advise using regular expressions to remove these blocks, or should I use > something like BeautifulSoup to parse the page formally and then remove > elements? > Only thumbnailed images appear in such blocks. You should really just remove <img> tags if you want to get rid of images.
>>> 3. Some characters are not rendered correctly (e.g., IPA: [ˈvÉ”lfgaÅ‹ >>> amaˈdeus ˈmoËtsart]). >> You're showing the text as windows-1252, but it is UTF-8. > > It seems that the html lacks the meta field that specifies the character > encoding. The original page does not, of course. Is there a parameter that > causes action=render to include the metadata? Am I using the wrong action? > Can I safely assume that all Wikipedia pages use UTF-8? Yes, MediaWiki always outputs UTF-8. Roan Kattouw (Catrope) _______________________________________________ Mediawiki-api mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
