@Ada >What do we do with students/graduates who were fed archaic ideals? You give them full professorships. ;-)
- Hugh On Wed, Oct 25, 2023 at 7:29 AM Ada Wan via Corpora <[email protected]> wrote: > Dear Christian, dear all [pls feel free to disregard if not interested] > > Discussion on a separate thread is fine. > > 1. Re "whether or not lemmatization is a valid NLP task": > I must first clarify that I am not on this mailing list just for "NLP > issues" (however "NLP" should be defined or regardless of whether it should > exist as an area with "word"-based or non-general/generalizable methods > beyond "machine/computational/automatic processing with (text) data"). > Machine processing with data, text/"language" or not, can be done without > "words" anyway. > I did not just question whether lemmatization is a valid NLP task. I > questioned (the necessity/validity of) morphology in general, which would > affect the practice of lemmatization. > > 2. Re "lemmas were not invented for NLP": > it depends what "lemmas" and "NLP" refer to. The study of > morphology/morphs/morphemes certainly dates back quite some time (e.g. > Panini --- in terms of decomposing a "word" into smaller parts). BUT the > practice of naming segments as "lemmas" (and not "morphs"/"morphemes") and > the use of the term "lemma" might have come from computational > linguists/lexicographers and/or computing. Computing practices might have > reinforced the practice of lemmatization/segmentation throughout the past > decades, since back in the days (e.g. 1960s-1970s? [1]) when memory was > more of an issue or when linguistic techniques were leveraged when > computing with text. > > 3. Re "Bronze Age dictionaries/word lists of cuneiform languages": > i. some of these are effects of interpretation (much of which dated back > to the modern era, e.g. papyrology from the 19th century); > ii. I do not argue against the possibility/practice of decomposition in > general, but (linguistic) morphology is not a general decomposition > approach for its being based on a notion of "stemhood" that can be > arbitrary, indeterminate, "culture"/context-specific, and/or idiosyncratic > (recall many hard-to-decipher symbols/graphemes on many ancient > manuscripts). A more general method would be to decompose in a granularity > that is fine enough and recompose based on frequency (as that's also often > a pivotal criterion for empirical analyses and interpretations). > > 4. Re "the use of head words in dictionaries is a practice that won't go > away as long as people are going to use dictionaries ... for language > learning": > many lexical resources (e.g. dictionaries) are based on character n-grams > and do not leverage the notion of "head words". The notion of "morphology" > is hence orthogonal/irrelevant. > (Remark: > a few decades ago, it might have been much easier in some parts of the > world to get clarity on this --- just by walking into a bookstore or a > library and looking at the plethora of lexical resources --- of different > types/formats/designs, in general or for particular disciplines. But that > practice seems to have (almost) become a lost art now. > For "language learning", I'd recommend the immersion method. Nothing beats > experiencing communication in multi-dimensional, full-bodied contexts. Use > lexical resources only as mnemonics of sorts (don't become too > pedantic/addicted on such). Use style guides (or "grammar textbooks") only > when pleasing others is necessary. :) Just thought to note to those on this > list who might be interested.) > > 5. Re "inflection patterns": please see my reply to Orhan earlier today: > [tweet/x] > The solution can be adapted for "(morphological) segmentation" as well. > Please let me know if it is fine to you or if you have any objections. > > 6. Re "low resource ... just plain legacy word lists and grammar sketches": > if one works in data collection: for varieties that are still alive, one > should record raw and full data when possible and retire the ("colonial") > practice of elicitation based on "words". One could also try to obtain > parallel data in larger spans instead. For varieties that are extinct, one > archives what one has. > For what purposes should any "word"-based practices or linguistic > morphology be involved? > > 7. "won't go away in corpus linguistics and the philologies": > May it be for corpus linguistics, the philologies, the humanities and the > social sciences --- digital or not, for "practical" purposes or not, > everything (methods, approaches, interpretations, reception... etc.) can be > updated. > > 8. Re "[w]hether or not the use of lemmas ... is a valid task depends on > the use case" and "data modeling": > sure, the use of tools can depend on the purpose of the task. But the > issue here is: if the use of lemmas is only good for the task of > lemmatization, and if the use of lemmatization with text data is only good > for linguistic morphology, and when morphology is found not (or no longer) > relevant/useful/correct/appropriate, what do we do with a curriculum that > overfits on one representation granularity that does not have a solid > foundation? What do we do with students/graduates who were fed archaic > ideals? > > Best > Ada > (Some often forget that I am also a linguist, not just a "computational > person", among other roles/interests.) > > [1] My dating references here are supported by: "Algorithms for stemming > have been studied in computer science since the 1960s." and "The first > published stemmer was written by Julie Beth Lovins in 1968." ( > https://en.wikipedia.org/wiki/Stemming) > I would've guessed from the 1950s otherwise.... > > On Wed, Oct 18, 2023 at 9:05 AM Christian Chiarcos via Corpora < > [email protected]> wrote: > >> Dear Ada, dear all, >> >> I think it's necessary to discuss this in a separate thread. As for Hugh, >> he had a practical problem with an existing data set and we can discuss >> specific solutions for that. As for Ada, whether or not lemmatization is a >> valid NLP task can be discussed, as well, but this has absolutely nothing >> to do with the very concrete request for advice on a real problem at hand. >> >> I really don't want to dive into this, but focus on the first part. Of >> course, there are applications where lemmatization as an NLP task was >> assumed to be necessary but is no longer needed. But lemmas were not >> invented for NLP, they were invented for structuring dictionaries and >> describing morphology actually several millenia before the computer (I'm >> thinking of Bronze Age dictionaries/word lists of cuneiform languages here, >> used for teaching Sumerian, but there even in our 3rd m. BCE Sumerian >> cuneiform corpus from the time when it was still spoken, there was a notion >> of lemma or head word, and scribes sometimes just wrote that because they >> were to lazy to write the full morphology). And the use of head words in >> dictionaries is a practice that won't go away as long as people are going >> to use dictionaries (be they digital or not) for language learning. And >> that's equally true for writing textbook grammars and for teaching >> morphology (you need some kind of base form to describe your inflection >> patterns), as it is for rule-based morphology (that won't go away, either, >> even though the use case is more on the low resource side of things ... low >> resource meaning few corpus data, no parallel data, just plain legacy word >> lists and grammar sketches). And also, it won't go away in corpus >> linguistics and the philologies, at least not for use cases where people >> come from a dictionary perspective. >> >> Whether or not the use of lemmas (note that the question was actually not >> about lemmatization, but about data modelling) is a valid task depends on >> the use case. Working with humanists that want that because it's their >> established practice is a valid use case. We can debate with them, of >> course, but they are the experts on their use case, and I'd prefer to >> devote my energy to something more practically relevant, like getting them >> away from using MS Office for annotations or dictionaries and to use any >> tool that produces structured output, instead. And already this can be a >> hard problem that might eventually kill an otherwise interesting project. >> (Apologies, that's not true of everyone, of course, but those cases exist, >> and even where people understand the necessity, we still have to work with >> decades of legacy data to bring into shape.) As for the role of >> lemmatization in NLP, please continue to discuss without me. >> >> @Ada, you seem to have a very concrete idea in mind how to get humanists >> away from getting lemmas. I guess that could be an interesting discussion >> at a conference on DH or language learning -- because this is where the >> requirement comes from. >> >> Best, >> Christian >> >> Am Di., 17. Okt. 2023 um 19:45 Uhr schrieb Bilgin, Orhan (Postgraduate >> Researcher) via Corpora <[email protected]>: >> >>> Dear Ada, >>> >>> I agree that lemmatisation is a construct and is not a universal method >>> for linguistic analyses, but I don't understand why it is imperative that I >>> wean myself from using lemmas. >>> >>> What is it that restricts my freedom to invent the lemma (a >>> non-universal construct) AĞAÇ-, for example, to refer to the one and only >>> "meaningful thing" that is common to the very many (theoretically infinite, >>> practically probably around 10,000) strings including ağaç, ağacı, ağaca, >>> ağaçlar, ağacımızdaki, ağaçlandırılabilmesinden, ağaçsızlaşmasını, etc. >>> etc.? How (and why) am I supposed to talk about that very large set without >>> using a label for it? >>> >>> Best, >>> >>> Orhan Bilgin >>> >>> >>> On 17 Oct 2023 18:36, Ada Wan via Corpora <[email protected]> >>> wrote: >>> >>> *This email originated outside the University. Check before clicking >>> links or attachments.* >>> Dear Christian >>> >>> Re your PS: >>> one doesn't need to debate the use/future of lemmatization, though I'd >>> welcome such as part of scholarship. For those experienced in matters in/of >>> Linguistics, it should be clear that lemmatization was simply a cconstruct, >>> a entry-level philological exercise (esp. for those from Computer Science >>> with less of a background in Linguistics and language(s)). It has been sad >>> that some have picked up the habit of using lemmatization as a heuristic >>> (though for what, specifically?) and might have become, apparently, too >>> addicted to it to let it go. It is imperative that one weans themselves >>> from such habit. >>> Methods for linguistic morphology, e.g. (morphological) parsing or >>> stemming, are not a universal decomposition scheme, nor a universal method >>> for language/linguistic analyses. Also important is to bear in mind is that >>> neither linguistic morphology nor lemmas/lemmata doesn't/don't have that >>> long of a history. >>> >>> Thanks for being open-minded enough to read this far. >>> >>> Best >>> Ada >>> >>> >>> _______________________________________________ >>> Corpora mailing list -- [email protected] >>> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ >>> To unsubscribe send an email to [email protected] >>> >> _______________________________________________ >> Corpora mailing list -- [email protected] >> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ >> To unsubscribe send an email to [email protected] >> > _______________________________________________ > Corpora mailing list -- [email protected] > https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ > To unsubscribe send an email to [email protected] >
_______________________________________________ Corpora mailing list -- [email protected] https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to [email protected]
