*Mentor: Dimitris Kontokostas*
*Student: Aditya Nambiar*

GSoC Project: Automatic Mappings Extraction & Upgrade Sweble Parser
Link: https://summerofcode.withgoogle.com/dashboard/project/

TL;DR;: The task of this project was to create extractors that identify
Wikidata annotations in Wikipedia articles and Wikipedia templates and
transform them to DBpedia mappings. We not only managed to complete this
task but also upgraded the DBpedia parser to enable the parsing of more
complex / nested templates.

Click here
view this report nicely formatted.

Here’s a long version

DBpedia currently maintains a mapping between Wikipedia info-box properties
to the DBpedia ontology, since several similar templates exist to describe
the same type of info-boxes. The aim of the project is to enrich the
existing mapping and possibly correct the incorrect mapping using Wikidata.

*Extracting Article Wikidata annotations*

Wikipedia provides parser functions that can fetch values from wikidata and
display them directly in a wikipedia article [ link] . For example in an
article, we can find the following:

{{ Infobox Test1
| area_total_km2         = 54.84
| population_as_of       = {{#invoke:Wikidata|getQualifie
| population_note        =
| population_total       = {{#property:P1082}}

We extract this information, and generate:
1. ("Infobox Test1","population_as_of","P1082/P585"),
2. ("Infobox Test1","population_total","P1082")
At the end we evaluate all the triples

Link to the extractor - link

*Extracting Template Wikidata annotations*

Sometimes the wikidata annotations are embedded directly in a wikipedia
template. In those case we assume that the mapping is direct. For example
in the infobox template, we can find the following
Inside page “Infobox Test1” (the infobox definition)

| data37 = {{#if:{{{website|}}}
                 |{{#ifeq:{{{website|}}}|hide||{{{website|}}} }}
| established_date  = {{#if: {{{established_date|}}} |
{{{established_date}}} | {{#invoke:Wikidata|property|P765}} }}

And we extract this information and generate
1. ("Infobox Test1","website","P856")
2. ("Infobox Test1","hide","P856")
3. ("Infobox Test1","established_date","P765")
4. ("Infobox Test1","URL","P856")

Annotations in templates are considered more credible and can be applied
directly while annotation in articles need some extra post processing to
identify possible outliers. (left as a follow up work)

Link to the extractor - link

*Advanced WikiText parsing*

To do the above extractions, we made extensive use of the AST generated by
the simple parser. However in several cases the simple parser failed to
create a correct AST, especially when it has nested template parameters
like for eg :-
  try2 = {{#if: abc |{{#ifeq:{{{website|}}}|hide||{{{website|}}} }} | pqrs

The simple parser would create text nodes with text = "}|hide||", which
makes no sense.
The earlier Simple Parser failed at parsing ParserFunctionNodes which was
important for the first phase of the project. The Sweble Parser solves the

*Upgrade Sweble Parser*

As mentioned above to deal with the cases where the simple parser fails we
decided to upgrade the existing sweble parser to V2.1

*Work Done*

Successfully upgraded the parser and added several additional functionality
to the sweble parser which the earlier Sweble Wrapper did not do such as
XmlElements, ImageLinks etc
We then created parameterized unit tests to help developers know where the
two parser create similar AST’s and where they differ by overriding the
“equals to” operator in each of the children classes of the Node class .
Parameterized unit tests also make it very easy to add new test cases.
We also tested the 2 parsers across several diverse wikipedia pages from
abstract things like Renaissance to, books and  famous people like Adolf
hitler, monuments etc.

**Project code / commits**
All Commits to Master branch - link
Pull Request   <https://github.com/dbpedia/extraction-framework/pull/472>

Click here
to view this report nicely formatted.
DBpedia-discussion mailing list

Reply via email to