Jeffrey Barish schreef:
> On Saturday 21 March 2009 16:15:54 Platonides wrote:
>> Jeffrey Barish wrote:
>>> I am writing a PyGTK application.  I would like to be able to download
>>> text only (with formatting) from Wikipedia and display it in my
>>> application.  I think that I am close to a solution, but I have reached
>>> an impasse due to my ignorance of most of the mediawiki API.
>>>
>>> My plan has been to use GtkMozembed in my application to render the page,
>>> so I need to retrieve html.  What is close to working is to use the
>>> index.php API with action=render and title=<search string for the
>>> Wikipedia page>.  The data that I retrieve does display in my browser,
>>> but it has the following undesired characteristics:
>>>
>>>
>>> 2. There are sections at the end that I don't want (Further reading,
>>> External links, Notes, See also, References).
>> Those sections are part of the content. The API doesn't have any
>> parameter to include/exclude them.
>>
>>> 1. All images appear (I want none).
>> Same issue. Although it's easier to replace, remove /<img.*?>/
> 
> It seems that images appear in <div class="thumbcaption"></div> blocks.  
> Would 
> you advise using regular expressions to remove these blocks, or should I use 
> something like BeautifulSoup to parse the page formally and then remove 
> elements?
> 
Only thumbnailed images appear in such blocks. You should really just 
remove <img> tags if you want to get rid of images.

>>> 3. Some characters are not rendered correctly (e.g., IPA: [ˈvɔlfgaŋ
>>> amaˈdeus ˈmoːtsart]).
>> You're showing the text as windows-1252, but it is UTF-8.
> 
> It seems that the html lacks the meta field that specifies the character 
> encoding.  The original page does not, of course.  Is there a parameter that 
> causes action=render to include the metadata?  Am I using the wrong action?  
> Can I safely assume that all Wikipedia pages use UTF-8?
Yes, MediaWiki always outputs UTF-8.

Roan Kattouw (Catrope)

_______________________________________________
Mediawiki-api mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Reply via email to