jayvdb added a subscriber: tommorris.
jayvdb added a comment.
In https://phabricator.wikimedia.org/T78416#938277, @murfel wrote:
> I think to implement it in the following way: catch all page which link to a
> given template, get HTML for each page, look for table with
> id="template_name" inside of HTML, parse key-values in the table and add them
> to Wikibase.
>
> Did I get it right?
maybe, but maybe not. My inclusion of {{Persondata}} as an example was perhaps
misleading.
this harvest_microformats script should not be based on templates, as is the
job of harvest_template.py .
This script will use pagegenerators as arguments to select which pages should
be processed, and -page:"..." is the easiest to use for testing.
For each page, get the HTML as you've said, and look for __microformats__
(http://microformats.org/) in the HTML. Microformats are usually described
using HTML class:".." attributes, such as:
view-source:https://en.wikipedia.org/wiki/Benjamin_Franklin
<span class="bday">1706-01-17</span>
<span class="dday deathdate">1790-04-17</span>
and
view-source:https://en.wikipedia.org/wiki/Manchester_Ship_Canal
<th colspan="2" class="fn org"
style="text-align:center;font-size:125%;font-weight:bold;font-size: larger;
background-color: #CEDEFF">Manchester Ship Canal</th>
The two most important standardised microformats are
http://microformats.org/wiki/hCard
http://microformats.org/wiki/hCalendar
Another icroformat that is very relevant to wikis is
http://microformats.org/wiki/rel-license
However Wikimedia mostly uses its own non-standard microformats, for example,
"licensetpl" is used by Wikisource and Wikimedia Commons instead of rel-license
view-source:https://en.wikisource.org/wiki/The_Clipper_Ship_Era
<table class="licensetpl" style="display:none;">
<tr>
<td><span class="licensetpl_short">Public domain</span><span
class="licensetpl_long">Public domain</span><span
class="licensetpl_link_req">false</span><span
class="licensetpl_attr_req">false</span></td>
</tr>
</table>
When microformats have been found in the HTML, yes .... "parse key-values [from
the microformat] and add them to Wikibase" , but .. there are python libraries
that already do most of the grunt work for you, so hopefully you dont need to
do the parsing yourself, e.g. see http://microformats.org/wiki/parsers and
search https://pypi.python.org/pypi/ . One library mentioned is
https://github.com/tommorris/mf2py , which is maintained by @tommorris ,
English Wikipedia admin among other things.
TASK DETAIL
https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign
<username>.
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: murfel, jayvdb
Cc: Aklapper, jayvdb, murfel, tommorris, pywikipedia-bugs
_______________________________________________
Pywikipedia-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-bugs