*Mentor: Dimitris Kontokostas* *Student: Aditya Nambiar* GSoC Project: Automatic Mappings Extraction & Upgrade Sweble Parser Link: https://summerofcode.withgoogle.com/dashboard/project/ 6253053984374784/details/
TL;DR;: The task of this project was to create extractors that identify Wikidata annotations in Wikipedia articles and Wikipedia templates and transform them to DBpedia mappings. We not only managed to complete this task but also upgraded the DBpedia parser to enable the parsing of more complex / nested templates. Click here <https://docs.google.com/document/d/122NY2mjpMredEGSTUqoyOF1PEIbDZJsoDSahzYECwLk/> to view this report nicely formatted. Here’s a long version DBpedia currently maintains a mapping between Wikipedia info-box properties to the DBpedia ontology, since several similar templates exist to describe the same type of info-boxes. The aim of the project is to enrich the existing mapping and possibly correct the incorrect mapping using Wikidata. *Extracting Article Wikidata annotations* Wikipedia provides parser functions that can fetch values from wikidata and display them directly in a wikipedia article [ link] . For example in an article, we can find the following: {{ Infobox Test1 | area_total_km2 = 54.84 | population_as_of = {{#invoke:Wikidata|getQualifie rDateValue|P1082|P585|FETCH_WIKIDATA|dmy}} | population_note = | population_total = {{#property:P1082}} }} We extract this information, and generate: 1. ("Infobox Test1","population_as_of","P1082/P585"), 2. ("Infobox Test1","population_total","P1082") At the end we evaluate all the triples Link to the extractor - link <https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxMappingsExtractor.scala> *Extracting Template Wikidata annotations* Sometimes the wikidata annotations are embedded directly in a wikipedia template. In those case we assume that the mapping is direct. For example in the infobox template, we can find the following Inside page “Infobox Test1” (the infobox definition) | data37 = {{#if:{{{website|}}} |{{#ifeq:{{{website|}}}|hide||{{{website|}}} }} |{{#if:{{#property:P856}} |{{URL|{{#property:P856}}}} }} }} | established_date = {{#if: {{{established_date|}}} | {{{established_date}}} | {{#invoke:Wikidata|property|P765}} }} }} And we extract this information and generate 1. ("Infobox Test1","website","P856") 2. ("Infobox Test1","hide","P856") 3. ("Infobox Test1","established_date","P765") 4. ("Infobox Test1","URL","P856") Annotations in templates are considered more credible and can be applied directly while annotation in articles need some extra post processing to identify possible outliers. (left as a follow up work) Link to the extractor - link <https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxMappingsTemplateExtractor.scala> *Advanced WikiText parsing* To do the above extractions, we made extensive use of the AST generated by the simple parser. However in several cases the simple parser failed to create a correct AST, especially when it has nested template parameters like for eg :- try2 = {{#if: abc |{{#ifeq:{{{website|}}}|hide||{{{website|}}} }} | pqrs }} The simple parser would create text nodes with text = "}|hide||", which makes no sense. The earlier Simple Parser failed at parsing ParserFunctionNodes which was important for the first phase of the project. The Sweble Parser solves the problem. *Upgrade Sweble Parser* As mentioned above to deal with the cases where the simple parser fails we decided to upgrade the existing sweble parser to V2.1 *Work Done* Successfully upgraded the parser and added several additional functionality to the sweble parser which the earlier Sweble Wrapper did not do such as XmlElements, ImageLinks etc We then created parameterized unit tests to help developers know where the two parser create similar AST’s and where they differ by overriding the “equals to” operator in each of the children classes of the Node class . Parameterized unit tests also make it very easy to add new test cases. We also tested the 2 parsers across several diverse wikipedia pages from abstract things like Renaissance to, books and famous people like Adolf hitler, monuments etc. **Project code / commits** All Commits to Master branch - link <https://github.com/dbpedia/extraction-framework/commits/master?author=aditya-nambiar> Pull Request <https://github.com/dbpedia/extraction-framework/pull/472> Click here <https://docs.google.com/document/d/122NY2mjpMredEGSTUqoyOF1PEIbDZJsoDSahzYECwLk/> to view this report nicely formatted.
------------------------------------------------------------------------------
_______________________________________________ DBpedia-discussion mailing list DBpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion