*PhD position: Knowledge extraction from semi-structured documents –
enrichment of DBpedia in French*
*Context*
We are seeking a candidate for a PhD position in the context of a
collaboration between the MELODI group (
http://www.irit.fr/-Equipe-MELODI- )of the Research Institute in
Informatics of Toulouse (IRIT, CNRS UMR 5505) and the CLLE-ERSS (
ttp://w3.erss.univ-tlse2.fr/ <http://w3.erss.univ-tlse2.fr/> ) team of
the Cognition, Languages, Ergonomics laboratory (CLLE, UMR 5263 CNRS).
These laboratories form one of the strongest potentials of research in
France, in Informatics and Linguistics, respectively. The teams have
been collaborating for 20 years and are recognized experts in natural
language processing, linguistic analysis of corpora, and knowledge
engineering. One of their research areas concerns the linguistic
characterisation of semantic relations in corpora and the
operationalization of these characterizations in order to facilitate the
construction of knowledge models. Methods for analyzing both written
texts - using lexico-syntactic patterns (Aussenac-Gilles and Jacques,
2008) or distributional analysis (Fabre et al 2014.) - and text
structure (Kamel and al., 2014) have been developed.
Methods have also been proposed for integrating different fragments of
knowledge within a same model, by means of ontology alignments (Euzenat
et al., 2013). Hence, this thesis aims at adapting and combining these
methods and proposing novel ones, with a special focus on enriching the
Web of data. The candidate will be co-supervised by Cécile Fabre,
Professor of Linguistics at University of Toulouse 2, and Mouna Kamel,
Assistant Professor at IRIT. The thesis will be funded in the context of
a project « Communauté d’Universités et d’Établissements Toulouse –
Région Midi-Pyrénées » (COMUE-Région).
*Object*
This thesis addresses the problem of building semantic resources from
semi-structured text. The attributes of the text layout, which organise
the text and contribute significantly to its semantics,
areunderexploited by most classical NLP methods. A first aim of this
thesis is to study the interaction between the visual structure and the
discourse analysis, and thus to specify how the analysis of natural
language and the analysis of the text structure can be combined
together. The second aim is to evaluate the contribution of linguistic
information within automated processes for theconstruction of semantic
resources, for the identification of semantic relations, and for their
integration into a knowledge model.
The theoretical results will help to developing different knowledge
extractors (in particular, semantic relation extractors) from
semi-structured texts in French, in order to enrich a knowledge base.
Each extractor will apply one particular technique (inspired or not by
the methods developed by the teams) and will exploit the different
properties (content and structure) of these texts. The experimental
scenario will concern the enrichment of the French DBpedia resource
(http://fr.dbpedia.org/), by extracting knowledge from Wikipedia pages
in French. These pages are semi-structured and rich in knowledge
expressing concepts (domain-specific or general), relations, and rules
associating them and giving them meaning. However, as for the DBPedia in
English, this resource is currently constructed from veryspecific
structured data (infobox, categories, links, etc.) from Wikipedia pages,
*Profile*
We are looking for a candidate with a Msc in Computer
Engineering/Science or an adjacent field. The candidate has to have
followed lectures in natural language processing. She/he is required to
have an interest in both linguistic (corpus analysis, study and
description of linguistic phenomena, etc.) and statistical aspects that
will allow her/him to develop learning-based approaches and
distributional analysis techniques. Interest in the Semantic Web in
general, and ontologies in particular, would also be appreciated. The
student has to be fluent in French and has to have a very good level in
English.
We are currently offeringa 3-year fully-funded
Studenship<http://kmi.open.ac.uk/studentships/vacancies/> commencing in
Autumn 2015, thanks to fundings from the Toulouse COMUE and
Midi-Pyrénées Region. Income will be about 20 000 euros /year.
**
*Contact*
**
If you are interested in the above, please contact :
Cécile Fabre : [email protected]
<mailto:[email protected]>
Mouna Kamel : [email protected] <mailto:[email protected]>
**
*References*
**
(Aussenac-Gilles et Jacques, 2008) Aussenac–Gilles, N., Jacques, M.–P. :
Designing and Evaluating Patterns for Relation Acquisition from Texts
with Caméléon. In: Terminology 14,1, 145–73 (2008).
(Euzenat et al., 2013) J. Euzenat, M. Rosoiu, C. Trojahn dos Santos :
Ontology matching benchmarks: Generation, stability, and
discriminability.Journal of Web Semantics 21: 30-48 (2013)
(Fabre et al., 2014) Fabre, C., Hathout, N., Ho-Dac, L. M.,
Morlane-Hondère, F., Muller, P., Sajous, F., Tanguy, L., Van de Cruys,
T. : Présentation de l'atelier SemDis 2014: sémantique distributionnelle
pour la substitution lexicale et l'exploration de corpus spécialisés.
Actes de l'atelier SemDis 2014, 21eConférencesurle
TraitementAutomatiquedesLanguesNaturelles(TALN 2014),pp.196-205, (2014).
(Kamel et al., 2014) Kamel, M., Rothenburger, B., Fauconnier, J-P. :
Identification de relations sémantiques portées par les structures
énumératives paradigmatiques : une approche symbolique et une approche
par apprentissage supervisé. Revue d'Intelligence Artificielle, Hermès
Science, Numéro spécial Ingénierie des Connaissances. Nouvelles
évolutions., Vol. 28, N. 2-3, p. 271-296, (2014).