Hi everyone,

I have the pleasure to invite you to my Ph.D. defense next Thursday at 8:30 AM 
Paris time, it will take place at the Inria Paris Research Centre (2 rue Simone 
Iff, 75012 Paris) at the Jacques-Louis Lions room.  It will be followed by a 
pot de thèse around 11 am. 

The Ph.D. committee will be made of: 
        • Yoav Golberg, Bar Ilan University, Rapporteur 
        • ‪Lilja Øvrelid, University of Oslo, Rapportrice 
        • Benjamin Piwowarski, CNRS, Examinateur
        • Benoît Sagot INRIA Paris, Directeur
        • Natalie Schulter, IT University of Copenhagen, Examinatrice
        • Djamé Seddah, INRIA Paris, Superviseur 


Abstract:

Deep Learning techniques applied to Natural Language Processing (NLP) have led 
to impressive empirical progress in recent years. In essence, this progress is 
due to the development of better-contextualized representations of textual data 
that can be easily used --- or transferred --- for a wide variety of NLP tasks. 
In their most recent and popular forms, these models consist of large-scale 
deep-learning language models, first pretrained on a large quantity of raw data 
and then adapted to specific tasks. These language models are now essential for 
search engines, question-answering pipelines, machine translation systems, etc.

However, these models usually require substantial computing power and large 
amounts of raw textual data. This makes natural language's inherent diversity 
and variability a vivid challenge in NLP. Indeed, collecting large datasets for 
low-resource languages is challenging and costly, and training models from 
scratch for every domain and language is unreasonable in practice. 
Additionally, understanding the behavior of deep learning-based models is 
intrinsically tricky, making the development of more cost-effective techniques 
even more challenging. 

For these reasons, we focus on the following question: ``How can we make 
language models better at handling the variability and diversity of natural 
languages?''.

As a starting step, we explore the generalizability of language models by 
building one of the first large-scale replication of a BERT model for a 
non-English language. We analyze the critical training ingredients and show 
that it can achieve state-of-the-art performance with only a few gigabytes of 
diverse data. 

Our results raise the question of using these language models on 
highly-variable domains such as these found in user-generated content. Focusing 
on domain-gap reduction via lexical normalization, we show that this task can 
be addressed accurately with BERT-like models. However, we show that it only 
partially helps downstream performance.  
In consequence, we focus on direct adaptation techniques using what we refer to 
as representation transfer and explore challenging settings such as the 
zero-shot setting, low-resource language varieties like Bambara or Uyghur, and 
highly variable and non-standardized code-mixed dialects such as a 
North-African Arabic dialect written in the Latin script. We show that 
multilingual language models can be adapted and used efficiently with 
low-resource languages, even with the ones unseen during pretraining, and that 
the script is a critical component in this adaptation. 

NLP technologies are becoming increasingly critical to accessing knowledge, 
connecting with our friends, and extracting meaningful information from large 
quantities of text. In this thesis, we present concrete and usable solutions to 
ensure that we can build accurate NLP systems for the most significant number 
of domains and languages at a reasonable cost. 


For those who cannot make it in person, a video-conference link will be shared 
in the following google doc:

https://docs.google.com/document/d/1YW1sa_x6oTLiXzgXoc2vcJqxaiZmhyu4TUGgl_DqMvk/edit?usp=sharing

Best,
Benjamin Muller
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to