[Corpora-List] Re: Lemmas and Lemmatization [was Re: NIF: NLP Interchange Format]

Hugh Paterson III via Corpora Wed, 25 Oct 2023 11:58:34 -0700

@Ada

>What do we do with students/graduates who were fed archaic ideals?
You give them full professorships. ;-)


- Hugh

On Wed, Oct 25, 2023 at 7:29 AM Ada Wan via Corpora <[email protected]>
wrote:

> Dear Christian, dear all [pls feel free to disregard if not interested]
>
> Discussion on a separate thread is fine.
>
> 1. Re "whether or not lemmatization is a valid NLP task":
> I must first clarify that I am not on this mailing list just for "NLP
> issues" (however "NLP" should be defined or regardless of whether it should
> exist as an area with "word"-based or non-general/generalizable methods
> beyond "machine/computational/automatic processing with (text) data").
> Machine processing with data, text/"language" or not, can be done without
> "words" anyway.
> I did not just question whether lemmatization is a valid NLP task. I
> questioned (the necessity/validity of) morphology in general, which would
> affect the practice of lemmatization.
>
> 2. Re "lemmas were not invented for NLP":
> it depends what "lemmas" and "NLP" refer to. The study of
> morphology/morphs/morphemes certainly dates back quite some time (e.g.
> Panini --- in terms of decomposing a "word" into smaller parts). BUT the
> practice of naming segments as "lemmas" (and not "morphs"/"morphemes") and
> the use of the term "lemma" might have come from computational
> linguists/lexicographers and/or computing. Computing practices might have
> reinforced the practice of lemmatization/segmentation throughout the past
> decades, since back in the days (e.g. 1960s-1970s? [1]) when memory was
> more of an issue or when linguistic techniques were leveraged when
> computing with text.
>
> 3. Re "Bronze Age dictionaries/word lists of cuneiform languages":
> i. some of these are effects of interpretation (much of which dated back
> to the modern era, e.g. papyrology from the 19th century);
> ii. I do not argue against the possibility/practice of decomposition in
> general, but (linguistic) morphology is not a general decomposition
> approach for its being based on a notion of "stemhood" that can be
> arbitrary, indeterminate, "culture"/context-specific, and/or idiosyncratic
> (recall many hard-to-decipher symbols/graphemes on many ancient
> manuscripts). A more general method would be to decompose in a granularity
> that is fine enough and recompose based on frequency (as that's also often
> a pivotal criterion for empirical analyses and interpretations).
>
> 4. Re "the use of head words in dictionaries is a practice that won't go
> away as long as people are going to use dictionaries ... for language
> learning":
> many lexical resources (e.g. dictionaries) are based on character n-grams
> and do not leverage the notion of "head words". The notion of "morphology"
> is hence orthogonal/irrelevant.
> (Remark:
> a few decades ago, it might have been much easier in some parts of the
> world to get clarity on this --- just by walking into a bookstore or a
> library and looking at the plethora of lexical resources --- of different
> types/formats/designs, in general or for particular disciplines. But that
> practice seems to have (almost) become a lost art now.
> For "language learning", I'd recommend the immersion method. Nothing beats
> experiencing communication in multi-dimensional, full-bodied contexts. Use
> lexical resources only as mnemonics of sorts (don't become too
> pedantic/addicted on such). Use style guides (or "grammar textbooks") only
> when pleasing others is necessary. :) Just thought to note to those on this
> list who might be interested.)
>
> 5. Re "inflection patterns": please see my reply to Orhan earlier today:
> [tweet/x]
> The solution can be adapted for "(morphological) segmentation" as well.
> Please let me know if it is fine to you or if you have any objections.
>
> 6. Re "low resource ... just plain legacy word lists and grammar sketches":
> if one works in data collection: for varieties that are still alive, one
> should record raw and full data when possible and retire the ("colonial")
> practice of elicitation based on "words". One could also try to obtain
> parallel data in larger spans instead. For varieties that are extinct, one
> archives what one has.
> For what purposes should any "word"-based practices or linguistic
> morphology be involved?
>
> 7. "won't go away in corpus linguistics and the philologies":
> May it be for corpus linguistics, the philologies, the humanities and the
> social sciences --- digital or not, for "practical" purposes or not,
> everything (methods, approaches, interpretations, reception... etc.) can be
> updated.
>
> 8. Re "[w]hether or not the use of lemmas ... is a valid task depends on
> the use case" and "data modeling":
> sure, the use of tools can depend on the purpose of the task. But the
> issue here is: if the use of lemmas is only good for the task of
> lemmatization, and if the use of lemmatization with text data is only good
> for linguistic morphology, and when morphology is found not (or no longer)
> relevant/useful/correct/appropriate, what do we do with a curriculum that
> overfits on one representation granularity that does not have a solid
> foundation? What do we do with students/graduates who were fed archaic
> ideals?
>
> Best
> Ada
> (Some often forget that I am also a linguist, not just a "computational
> person", among other roles/interests.)
>
> [1] My dating references here are supported by: "Algorithms for stemming
> have been studied in computer science since the 1960s." and "The first
> published stemmer was written by Julie Beth Lovins in 1968." (
> https://en.wikipedia.org/wiki/Stemming)
> I would've guessed from the 1950s otherwise....
>
> On Wed, Oct 18, 2023 at 9:05 AM Christian Chiarcos via Corpora <
> [email protected]> wrote:
>
>> Dear Ada, dear all,
>>
>> I think it's necessary to discuss this in a separate thread. As for Hugh,
>> he had a practical problem with an existing data set and we can discuss
>> specific solutions for that. As for Ada, whether or not lemmatization is a
>> valid NLP task can be discussed, as well, but this has absolutely nothing
>> to do with the very concrete request for advice on a real problem at hand.
>>
>> I really don't want to dive into this, but focus on the first part. Of
>> course, there are applications where lemmatization as an NLP task was
>> assumed to be necessary but is no longer needed. But lemmas were not
>> invented for NLP, they were invented for structuring dictionaries and
>> describing morphology actually several millenia before the computer (I'm
>> thinking of Bronze Age dictionaries/word lists of cuneiform languages here,
>> used for teaching Sumerian, but there even in our 3rd m. BCE Sumerian
>> cuneiform corpus from the time when it was still spoken, there was a notion
>> of lemma or head word, and scribes sometimes just wrote that because they
>> were to lazy to write the full morphology). And the use of head words in
>> dictionaries is a practice that won't go away as long as people are going
>> to use dictionaries (be they digital or not) for language learning. And
>> that's equally true for writing textbook grammars and for teaching
>> morphology (you need some kind of base form to describe your inflection
>> patterns), as it is for rule-based morphology (that won't go away, either,
>> even though the use case is more on the low resource side of things ... low
>> resource meaning few corpus data, no parallel data, just plain legacy word
>> lists and grammar sketches). And also, it won't go away in corpus
>> linguistics and the philologies, at least not for use cases where people
>> come from a dictionary perspective.
>>
>> Whether or not the use of lemmas (note that the question was actually not
>> about lemmatization, but about data modelling) is a valid task depends on
>> the use case. Working with humanists that want that because it's their
>> established practice is a valid use case. We can debate with them, of
>> course, but they are the experts on their use case, and I'd prefer to
>> devote my energy to something more practically relevant, like getting them
>> away from using MS Office for annotations or dictionaries and to use any
>> tool that produces structured output, instead. And already this can be a
>> hard problem that might eventually kill an otherwise interesting project.
>> (Apologies, that's not true of everyone, of course, but those cases exist,
>> and even where people understand the necessity, we still have to work with
>> decades of legacy data to bring into shape.) As for the role of
>> lemmatization in NLP, please continue to discuss without me.
>>
>> @Ada, you seem to have a very concrete idea in mind how to get humanists
>> away from getting lemmas. I guess that could be an interesting discussion
>> at a conference on DH or language learning -- because this is where the
>> requirement comes from.
>>
>> Best,
>> Christian
>>
>> Am Di., 17. Okt. 2023 um 19:45 Uhr schrieb Bilgin, Orhan (Postgraduate
>> Researcher) via Corpora <[email protected]>:
>>
>>> Dear Ada,
>>>
>>> I agree that lemmatisation is a construct and is not a universal method
>>> for linguistic analyses, but I don't understand why it is imperative that I
>>> wean myself from using lemmas.
>>>
>>> What is it that restricts my freedom to invent the lemma (a
>>> non-universal construct) AĞAÇ-, for example, to refer to the one and only
>>> "meaningful thing" that is common to the very many (theoretically infinite,
>>> practically probably around 10,000) strings including ağaç, ağacı, ağaca,
>>> ağaçlar, ağacımızdaki, ağaçlandırılabilmesinden, ağaçsızlaşmasını, etc.
>>> etc.? How (and why) am I supposed to talk about that very large set without
>>> using a label for it?
>>>
>>> Best,
>>>
>>> Orhan Bilgin
>>>
>>>
>>> On 17 Oct 2023 18:36, Ada Wan via Corpora <[email protected]>
>>> wrote:
>>>
>>> *This email originated outside the University. Check before clicking
>>> links or attachments.*
>>> Dear Christian
>>>
>>> Re your PS:
>>> one doesn't need to debate the use/future of lemmatization, though I'd
>>> welcome such as part of scholarship. For those experienced in matters in/of
>>> Linguistics, it should be clear that lemmatization was simply a cconstruct,
>>> a entry-level philological exercise (esp. for those from Computer Science
>>> with less of a background in Linguistics and language(s)). It has been sad
>>> that some have picked up the habit of using lemmatization as a heuristic
>>> (though for what, specifically?) and might have become, apparently, too
>>> addicted to it to let it go. It is imperative that one weans themselves
>>> from such habit.
>>> Methods for linguistic morphology, e.g. (morphological) parsing or
>>> stemming, are not a universal decomposition scheme, nor a universal method
>>> for language/linguistic analyses. Also important is to bear in mind is that
>>> neither linguistic morphology nor lemmas/lemmata doesn't/don't have that
>>> long of a history.
>>>
>>> Thanks for being open-minded enough to read this far.
>>>
>>> Best
>>> Ada
>>>
>>>
>>> _______________________________________________
>>> Corpora mailing list -- [email protected]
>>> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
>>> To unsubscribe send an email to [email protected]
>>>
>> _______________________________________________
>> Corpora mailing list -- [email protected]
>> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
>> To unsubscribe send an email to [email protected]
>>
> _______________________________________________
> Corpora mailing list -- [email protected]
> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
> To unsubscribe send an email to [email protected]
>

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Re: Lemmas and Lemmatization [was Re: NIF: NLP Interchange Format]

Reply via email to