Hi everyone,
I have the pleasure to invite you to my Ph.D. defense next Thursday at 8:30 AM
Paris time, it will take place at the Inria Paris Research Centre (2 rue Simone
Iff, 75012 Paris) at the Jacques-Louis Lions room. It will be followed by a
pot de thèse around 11 am.
The Ph.D. committee will be made of:
• Yoav Golberg, Bar Ilan University, Rapporteur
• Lilja Øvrelid, University of Oslo, Rapportrice
• Benjamin Piwowarski, CNRS, Examinateur
• Benoît Sagot INRIA Paris, Directeur
• Natalie Schulter, IT University of Copenhagen, Examinatrice
• Djamé Seddah, INRIA Paris, Superviseur
Abstract:
Deep Learning techniques applied to Natural Language Processing (NLP) have led
to impressive empirical progress in recent years. In essence, this progress is
due to the development of better-contextualized representations of textual data
that can be easily used --- or transferred --- for a wide variety of NLP tasks.
In their most recent and popular forms, these models consist of large-scale
deep-learning language models, first pretrained on a large quantity of raw data
and then adapted to specific tasks. These language models are now essential for
search engines, question-answering pipelines, machine translation systems, etc.
However, these models usually require substantial computing power and large
amounts of raw textual data. This makes natural language's inherent diversity
and variability a vivid challenge in NLP. Indeed, collecting large datasets for
low-resource languages is challenging and costly, and training models from
scratch for every domain and language is unreasonable in practice.
Additionally, understanding the behavior of deep learning-based models is
intrinsically tricky, making the development of more cost-effective techniques
even more challenging.
For these reasons, we focus on the following question: ``How can we make
language models better at handling the variability and diversity of natural
languages?''.
As a starting step, we explore the generalizability of language models by
building one of the first large-scale replication of a BERT model for a
non-English language. We analyze the critical training ingredients and show
that it can achieve state-of-the-art performance with only a few gigabytes of
diverse data.
Our results raise the question of using these language models on
highly-variable domains such as these found in user-generated content. Focusing
on domain-gap reduction via lexical normalization, we show that this task can
be addressed accurately with BERT-like models. However, we show that it only
partially helps downstream performance.
In consequence, we focus on direct adaptation techniques using what we refer to
as representation transfer and explore challenging settings such as the
zero-shot setting, low-resource language varieties like Bambara or Uyghur, and
highly variable and non-standardized code-mixed dialects such as a
North-African Arabic dialect written in the Latin script. We show that
multilingual language models can be adapted and used efficiently with
low-resource languages, even with the ones unseen during pretraining, and that
the script is a critical component in this adaptation.
NLP technologies are becoming increasingly critical to accessing knowledge,
connecting with our friends, and extracting meaningful information from large
quantities of text. In this thesis, we present concrete and usable solutions to
ensure that we can build accurate NLP systems for the most significant number
of domains and languages at a reasonable cost.
For those who cannot make it in person, a video-conference link will be shared
in the following google doc:
https://docs.google.com/document/d/1YW1sa_x6oTLiXzgXoc2vcJqxaiZmhyu4TUGgl_DqMvk/edit?usp=sharing
Best,
Benjamin Muller
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]