RESEARCH INTERNSHIP
*Quantifying diversity of language phenomena: Case
study of multiword expressions* (LIFAT, Blois, France)
We propose a master internship position in Blois
(France). Please send an email to apply, with a
CV, a transcript of bachelor and master grades,
and a few lines explaining your motivation to
Arnaud Soulet <[email protected]>
<mailto:[email protected]>, as well as
Agata Savary and Thomas Lavergne
<[email protected]>
<mailto:[email protected]>.
Internship proposal description:
https://selexini.lis-lab.fr/jobs/2022/11/26/internship
Application deadline: *December 8*, 2022 (or until
filled)
------
MOTIVATION AND CONTEXT
*Diversity* of naturally occurring phenomena is a
vital heritage to be preserved in the current
progress- and optimization-driven globalization
era. Diversity has been quantified in many
domains: ecology, economy, information science,
etc. but less so in *Natural Language Processing*
(NLP). Recently, we have been addressing this
aspect with respect to a particular linguistic
phenomenon: the one of *multiword expressions *
(MWEs).
MWEs, such as (FR) /casser sa pipe/ ‘to die’
(literally to break one’s pipe) or (FR) /sortir du
lot /'to be better than others' (literally to quit
the batch), are groups of words which exhibit
unexpected properties (Baldwin & Kim, 2010;
Constant et al. 2017). Most prominently, their
meaning does not straightforwardly derive from the
meanings of their components. Language resources
dedicated to MWEs include MWE lexicons and
MWE-annotated corpora (Savary et al., 2017), while
a major computational task is to *automatically
identify MWEs *in running text. The PARSEME
network has been addressing the MWE identification
task via a series of *shared tasks* on automatic
identification of verbal MWEs (Ramisch et al.
2020). Our recent work (Lion-Bouton, 2021;
Lion-Bouton et al. 2022) is explicitly dedicated
to *quantifying diversity in MWE language
resources and MWE identification systems*. We have
adapted measures of *variety* (number of types in
a system), *balance* (equity of items in various
types) and *disparity* (differences between
types), stemming notably from ecology and
information theory (Morales 2021).
------
OBJECTIVE
The objective of this internship is to extend the
formalisation of the diversity by benefiting from
*Good-Turing frequency estimation*. Successfully
used to estimate the biomass, Good-Turing
frequency estimation is a statistical technique
for estimating the probability of encountering an
object of an unseen species, given a set of past
observations of objects from different species
(Good, 1953). Under this same principle, the idea
would be to *estimate the number of unseen MWEs
from the MWEs observed *in the corpus. Thus, it
will be possible to correct the diversity measures
to take the unseen MWEs into account and to
evaluate the possible selection bias of the corpus._______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]