jayvdb added a subscriber: tommorris.
jayvdb added a comment.

In https://phabricator.wikimedia.org/T78416#938277, @murfel wrote:

> I think to implement it in the following way: catch all page which link to a 
> given template, get HTML for each page, look for table with 
> id="template_name" inside of HTML, parse key-values in the table and add them 
> to Wikibase.
>
> Did I get it right?


maybe, but maybe not.  My inclusion of {{Persondata}} as an example was perhaps 
misleading.

this harvest_microformats script should not be based on templates, as is the 
job of harvest_template.py .

This script will use pagegenerators as arguments to select which pages should 
be processed, and -page:"..." is the easiest to use for testing.

For each page, get the HTML as you've said, and look for __microformats__ 
(http://microformats.org/) in the HTML.  Microformats are usually described 
using HTML class:".." attributes, such as:

view-source:https://en.wikipedia.org/wiki/Benjamin_Franklin

<span class="bday">1706-01-17</span>
<span class="dday deathdate">1790-04-17</span>

and

view-source:https://en.wikipedia.org/wiki/Manchester_Ship_Canal

<th colspan="2" class="fn org" 
style="text-align:center;font-size:125%;font-weight:bold;font-size: larger; 
background-color: #CEDEFF">Manchester Ship Canal</th>

The two most important standardised microformats are
http://microformats.org/wiki/hCard
http://microformats.org/wiki/hCalendar

Another icroformat that is very relevant to wikis is 
http://microformats.org/wiki/rel-license

However Wikimedia mostly uses its own non-standard microformats, for example, 
"licensetpl" is used by Wikisource and Wikimedia Commons instead of rel-license

view-source:https://en.wikisource.org/wiki/The_Clipper_Ship_Era

<table class="licensetpl" style="display:none;">
<tr>
<td><span class="licensetpl_short">Public domain</span><span 
class="licensetpl_long">Public domain</span><span 
class="licensetpl_link_req">false</span><span 
class="licensetpl_attr_req">false</span></td>
</tr>
</table>

When microformats have been found in the HTML, yes .... "parse key-values [from 
the microformat] and add them to Wikibase" , but .. there are python libraries 
that already do most of the grunt work for you, so hopefully you dont need to 
do the parsing yourself, e.g. see http://microformats.org/wiki/parsers and 
search https://pypi.python.org/pypi/ . One library mentioned is 
https://github.com/tommorris/mf2py , which is maintained by @tommorris , 
English Wikipedia admin among other things.


TASK DETAIL
  https://phabricator.wikimedia.org/T78416

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: murfel, jayvdb
Cc: Aklapper, jayvdb, murfel, tommorris, pywikipedia-bugs



_______________________________________________
Pywikipedia-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-bugs

Reply via email to