[DBpedia-discussion] Integrating RML in the DBpedia extraction framework

Wouter Maroy Tue, 23 Aug 2016 01:28:18 -0700

Student: Wouter Maroy

Mentors: Anastasia Dimou, Dimitris Kontokostas


TL;DR; The goal of this GSoC project was to start the integration of RML (
http://rml.io) with the DBpedia mappings wiki. The project had 2+1 goals
(that all were completed successfully):

To read this in a nicely formatted way click here:
https://docs.google.com/document/d/1BwSG6Rg-tPZlaATIGvsnLkOSnU7wmRg2Gy57dSGhBrU/edit#
Introduction

DBpedia uses it’s own defined mappings for extracting triples from
Wikipedia. The goal of this project was to integrate RML, a general mapping
language for triples and replace the original mappings with RML mapping
documents. In terms of goals, this project had two main goals and one
optional goal.

Main goals:

- Translate the DBpedia defined mappings to RML mapping documents

- Importing RML documents into the extraction framework and converting them
to the existing DBpedia mapping data structures

Optional goal:

- Create a prototype of an integrated RML processor in the DBpedia
extraction framework

The project was a success. All goals of the project (including the optional
goal) were completed and generated successful results.
First goal: translating the DBpedia mappings to RML mappings

DBpedia uses different types of custom mappings (e.g. simple property
mappings,  date interval mappings) for extracting triples from Wikipedia
infoboxes. These are in general quite complex. Creating one-on-one mappings
from DBpedia mappings to RML mappings was no easy task. Designing these
mappings required quite some time during the project. We wanted this to be
very accurate because the better these translations are, the better the
results will be in the end of the process.

To create the alignment it was necessary to dive into the exact details of
how the DBpedia mappings were used in the extraction framework. In the
other way around, it was necessary to fully understand how an RML mapping
could produce the same results.

All the DBpedia mappings eventually got their RML mapping version. Some
mappings were straightforward but most of the cases were very specific and
needed a custom solution. The next step was to automate the translation
from the original DBpedia mapping files that are stored on GitHub to their
corresponding RML version. This has also been done and was implemented in
the extraction framework in the server module. Through this functionality
it is now possible to access the RML version of every DBpedia mapping that
is present on the running server.
Second goal: importing and converting RML

A first step towards integrating the executing of RML mapping documents is
adding a parser that understands RML documents and converts these into a
structure the extraction framework understands. To be specific, the
extraction framework uses mapping data structures to store it’s loaded
mappings. This parser loads the RML mapping documents and converts these to
the mapping data structures.

The advantage of using this method is that RML documents can be run and
generate triples just as if it were using the old mapping documents. There
are no big changes needed in the extraction framework itself to make this
work. The drawback is that not all functionality of RML is available. Only
the specific mappings designed for each DBpedia mapping can be understood
and executed by this parser. For all functionality to be available, an RML
processor needs to be integrated fully.

An implementation of this parser was added to the extraction framework. It
can read all the custom design mappings that were created. It is possible
for the framework to load and run these mappings. The produced results are
very good, the generated triples are the same as if the process would be
run with loading the original DBpedia mappings.
Optional goal: prototyping an integrated RML processor

To make all functionality from RML available a real RML processor is
needed. With an integrated RML processor it would be possible to test the
mapping documents that were designed during the first part of the project.
In the scope of this project an optional goal was to create a prototype to
give an idea what is possible.

There were some discussions on how this could be implemented and a solution
was picked. A prototype was implemented and produced positive results. The
generated triples were not all complete, but it served the purpose of a
proof-of-concept implementation. The implementation proved that this
workflow for integrating the processor is a possible solution if fully
implemented. There was no certainty if it would be possible to create this
prototype during the scope of this project. It depended on how long it
would take to finalize the main goals. Luckily everything went as planned
and the optional goal was completed successfully.
Links

Commits:
https://github.com/dbpedia/extraction-framework/commits?author=wmaroy
https://github.com/wmaroy/extraction-framework/commits?author=wmaroy
(unmerged)
GSoC Project
https://summerofcode.withgoogle.com/projects/#6213126861094912

------------------------------------------------------------------------------

_______________________________________________
DBpedia-discussion mailing list
DBpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

[DBpedia-discussion] Integrating RML in the DBpedia extraction framework

Reply via email to