[Corpora-List] Lemmas and Lemmatization [was Re: NIF: NLP Interchange Format]

Christian Chiarcos via Corpora Wed, 18 Oct 2023 00:05:27 -0700

Dear Ada, dear all,

I think it's necessary to discuss this in a separate thread. As for Hugh,
he had a practical problem with an existing data set and we can discuss
specific solutions for that. As for Ada, whether or not lemmatization is a
valid NLP task can be discussed, as well, but this has absolutely nothing
to do with the very concrete request for advice on a real problem at hand.

I really don't want to dive into this, but focus on the first part. Of
course, there are applications where lemmatization as an NLP task was
assumed to be necessary but is no longer needed. But lemmas were not
invented for NLP, they were invented for structuring dictionaries and
describing morphology actually several millenia before the computer (I'm
thinking of Bronze Age dictionaries/word lists of cuneiform languages here,
used for teaching Sumerian, but there even in our 3rd m. BCE Sumerian
cuneiform corpus from the time when it was still spoken, there was a notion
of lemma or head word, and scribes sometimes just wrote that because they
were to lazy to write the full morphology). And the use of head words in
dictionaries is a practice that won't go away as long as people are going
to use dictionaries (be they digital or not) for language learning. And
that's equally true for writing textbook grammars and for teaching
morphology (you need some kind of base form to describe your inflection
patterns), as it is for rule-based morphology (that won't go away, either,
even though the use case is more on the low resource side of things ... low
resource meaning few corpus data, no parallel data, just plain legacy word
lists and grammar sketches). And also, it won't go away in corpus
linguistics and the philologies, at least not for use cases where people
come from a dictionary perspective.

Whether or not the use of lemmas (note that the question was actually not
about lemmatization, but about data modelling) is a valid task depends on
the use case. Working with humanists that want that because it's their
established practice is a valid use case. We can debate with them, of
course, but they are the experts on their use case, and I'd prefer to
devote my energy to something more practically relevant, like getting them
away from using MS Office for annotations or dictionaries and to use any
tool that produces structured output, instead. And already this can be a
hard problem that might eventually kill an otherwise interesting project.
(Apologies, that's not true of everyone, of course, but those cases exist,
and even where people understand the necessity, we still have to work with
decades of legacy data to bring into shape.) As for the role of
lemmatization in NLP, please continue to discuss without me.

@Ada, you seem to have a very concrete idea in mind how to get humanists
away from getting lemmas. I guess that could be an interesting discussion
at a conference on DH or language learning -- because this is where the
requirement comes from.

Best,
Christian

Am Di., 17. Okt. 2023 um 19:45 Uhr schrieb Bilgin, Orhan (Postgraduate
Researcher) via Corpora <[email protected]>:

> Dear Ada,
>
> I agree that lemmatisation is a construct and is not a universal method
> for linguistic analyses, but I don't understand why it is imperative that I
> wean myself from using lemmas.
>
> What is it that restricts my freedom to invent the lemma (a non-universal
> construct) AĞAÇ-, for example, to refer to the one and only "meaningful
> thing" that is common to the very many (theoretically infinite, practically
> probably around 10,000) strings including ağaç, ağacı, ağaca, ağaçlar,
> ağacımızdaki, ağaçlandırılabilmesinden, ağaçsızlaşmasını, etc. etc.? How
> (and why) am I supposed to talk about that very large set without using a
> label for it?
>
> Best,
>
> Orhan Bilgin
>
>
> On 17 Oct 2023 18:36, Ada Wan via Corpora <[email protected]> wrote:
>
> *This email originated outside the University. Check before clicking links
> or attachments.*
> Dear Christian
>
> Re your PS:
> one doesn't need to debate the use/future of lemmatization, though I'd
> welcome such as part of scholarship. For those experienced in matters in/of
> Linguistics, it should be clear that lemmatization was simply a cconstruct,
> a entry-level philological exercise (esp. for those from Computer Science
> with less of a background in Linguistics and language(s)). It has been sad
> that some have picked up the habit of using lemmatization as a heuristic
> (though for what, specifically?) and might have become, apparently, too
> addicted to it to let it go. It is imperative that one weans themselves
> from such habit.
> Methods for linguistic morphology, e.g. (morphological) parsing or
> stemming, are not a universal decomposition scheme, nor a universal method
> for language/linguistic analyses. Also important is to bear in mind is that
> neither linguistic morphology nor lemmas/lemmata doesn't/don't have that
> long of a history.
>
> Thanks for being open-minded enough to read this far.
>
> Best
> Ada
>
>
> _______________________________________________
> Corpora mailing list -- [email protected]
> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
> To unsubscribe send an email to [email protected]
>

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Lemmas and Lemmatization [was Re: NIF: NLP Interchange Format]

Reply via email to