PhD in NLP - PATRIMALP Materials, Pigments, Lights: the colors of
Heritage – Natural Language Processing for cultural heritage
Starting date: October 01, 2022 (flexible)
Application deadline: September 5th, 2022
Interviews (tentative): September 12th, 2022
Salary: 1 975 € gross/month (social security included)
Mission: research oriented (teaching possible but not mandatory)
Keywords: natural language processing, knowledge representation,
cultural heritage, transfer learning, multilingualism
CONTEXT
The main challenge of the Patrimalp project is the development of an
integrated and interdisciplinary Heritage Science, in order to ensure
cultural Heritage sustainability, promotion and dissemination in
contemporary society. The ambition is to produce the forms of
intelligibility of a global and moving process which starts from the
collection of the raw material, its transformation into a primitive
object, different lives as a material (alterations, degradations,
transformations ...) and as a symbol (relegation, disinterest,
oblivion or rebirth, exaltation...) throughout history, and finally
from its election as an object of historical and Heritage value and
its “promotion” into a work of art. This research is applied to
understand how inks and pigments have been conceived over several
centuries, how it has been used in art work as well as how the
handcrafting method has evolved and been disseminated over centuries
and countries.
To make this study possible, the project will gather a large
collection of textual material made up of alchemical works and
collections of natural or artificial objects collected between the
16th and 18th centuries. To better understand the choice of colors for
these "wonders", we want to reconstruct the recipes for making colored
material in its context of thought, whether technical or symbolic.
These recipes will constitute a new body of research for literary
people and a new data-study case for building knowledge about color.
This corpus indeed offers modes of representation inscribed in complex
forms of writing and fiction whose modalities and frames of reference
remain to be analyzed (accounts of technical, medical or
physico-chemical experiments inscribed in fictional worlds or
mythological, symbolic descriptions of artifacts, or materials
collected in nature, mines). On the linguistic level, the inventory of
this lexicon in different European and Eastern languages will lead to
the formalization of the knowledge of these various skills over time
and several cultures. This corpus will thus provide complex data on
the material and symbolic origin of the ingredients of color, on its
use, its names and its physical or symbolic perception: these data
represent a challenge for computer researchers who will have to
organize them in order to benefit curators, chemists or physicists, in
ontologies representing the state of knowledge from the point of view
of scholars over the ages.
To systematically explore the corpus of these recipes, we will use NLP
techniques to uncover the correlations between recipes, physical and
chemical composition of objects and symbolic references. The final
objective is to build a knowledge base (objects, components of
objects, materials, colors, know-how, reference framework) each of the
parts being able to reference a specific ontology (ontology of
pigments, materials, colors...) to make it possible for researchers to
observe the trajectory from the writing of color to its technical and
artisan practice from this specific corpus.
PHD OBJECTIVES
The PhD project will focus on segmenting, extracting and representing
recipes from a corpus of alchemical works from the 16th and 18th
centuries to make them accessible to researchers in the humanities.
This necessitates to :
· identify which excerpts of the text belong to a recipe;
· supervise an annotation campaign to build an analysis and
training corpus
· build NLP tools to extract automatically the list of elements
(raw material, tools, quantity, units) and actions (verb, adverb,
adjective) that made up the recipes;
· analyze the dependencies between the elements of a recipe rules ;
· Represent these rules in a formal knowledge representation.
The results of this processing will support :
· The documentation of this unique set of text, by inserting the
extracted elements to the document meta data to easy retrieval
· The building a knowledge base of alchemical recipes
This PhD will need to address several challenges. One of them is to be
able to process text composed of multiple non-modern languages
(French, German, English, Latin, Greek) [Coavoux2022,Grobol2022] . One
approach we will be to study how large multilingual pre-trained models
[Delvin2019, Xue2020] can be leveraged and adapted for the task and
how disparate collection of corpora of ancient texts can be used to
fine-tune them. Another challenge will be the paucity of data for the
downstream tasks (segmentation, parsing, Natural Language
Understanding [Desot2022]) for this we will need to identify other
related corpus (e.g. cooking) to address the problem in a multitask
setting (such as NER and NLU) and transfer learning.
SKILLS
· Master 2 in Natural Language Processing, computer science or data
science.
· Good mastering of Python programming and deep learning frameworks.
· Previous experience in text classification, parsing, processing
of several languages or text retrieval would be a plus
· Very good communication skills in English and good command of French
SCIENTIFIC ENVIRONMENT
The thesis will be conducted within the STEAMER and GETALP teams of
the LIG laboratory
(http://steamer.imag.fr/ and https://lig-getalp.imag.fr/).The GETALP
team has strong expertise and track record in Natural Language
Processing, STEAMER team has strong expertise in Knowkledge
representation and reasoning.The recruited person will be welcomed
within the teams which offer a stimulating, multinational and pleasant
working environment. The means to carry out the PhD will be provided
both in terms of missions in France and abroad and in terms of
equipment (personal computer, access to the LIG GPU servers).
The PhD candidate will collaborate with the partners involved in the
PATRIMALP project, in particular with Laurence Rivière from the LUHCIE
lab (Laboratoire Universitaire Histoire Cultures Italie Europe) and
Véronique Adam from the LITT&ARTS lab (Littératures et Arts).
INSTRUCTIONS FOR APPLYING
Applications must contain: CV + letter/message of motivation + master
notes + be ready to provide letter(s) of recommendation; and be
addressed to Danielle Ziebelin
([email protected]), François Portet
([email protected]) Maximin Coavoux
([email protected])
REFERENCES
[Coavoux2022] Maximin Coavoux, Corinne Denoyelle, Olivier Kraif,
Julie Sorba. Phraséologie du roman médiéval en prose. Diachro X – le
français en diachronie, Sorbonne Université, May 2022, Paris, France
[Delvin2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers
for language understanding. In Proceedings of NAACL.
[Desot 2022] Desot, T., Portet, F., & Vacher, M. (2022). End-to-End
Spoken Language Understanding: Performance analyses of a voice command
task in a low resource setting. Computer Speech & Language, 75, 101369.
[Grobol2022] Loïc Grobol, Mathilde Regnault, Pedro Ortiz Suarez,
Benoît Sagot, Laurent Romary and Benoit Crabbé BERTrade: Using
Contextual Embeddings to Parse Old French. 13th International
Conference on Language Resources and Evaluation (LREC 2022), May 2022,
Marseille, France
[Xue2020] Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R.,
Siddhant, A., ... & Raffel, C. (2020). mT5: A massively multilingual
pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
--
François PORTET
Professeur - Univ Grenoble Alpes
Laboratoire d'Informatique de Grenoble - Équipe GETALP
Bâtiment IMAG - Office 333
700 avenue Centrale
Domaine Universitaire - 38401 St Martin d'Hères
FRANCE
Phone: +33 (0)4 57 42 15 44
Email: [email protected]
www: http://membres-liglab.imag.fr/portet/
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]