[Corpora-List] Re: Lemmas and Lemmatization [was Re: NIF: NLP Interchange Format]

Gilles Sérasset via Corpora Wed, 01 Nov 2023 02:34:44 -0700

Hi Ada,

As these threads consists in a discussion rather than a set of scientific 
statements (the first one being motivated by responding to a stimuli, while the 
second consists in defining/motivating a scientific position that is supposed 
to stand aside of any specific discussion), I forbid you to use any of my 
writings made on the corpora list in any of your web sites.


Of course, I still authorise corpora list to keep archives (as these are 
maintained along with the full discussion context).

Regards,

Gilles Sérasset,   

> On 31 Oct 2023, at 19:19, Ada Wan <[email protected]> wrote:
> 
> Dear all
> 
> I am about to post CorporaList threads which I have responded to on my own 
> website, as it seems some of my replies are not yet showing on the public 
> website 
> (https://list.elra.info/mailman3/hyperkitty/list/[email protected]/ 
> <https://list.elra.info/mailman3/hyperkitty/list/[email protected]/>). 
> If any of you should have any objections to this (because you don't want your 
> replies to be seen), please let me know asap. 
> 
> Thanks and best
> Ada
> 
> 
> On Mon, Oct 30, 2023 at 9:31 PM Ada Wan <[email protected] 
> <mailto:[email protected]>> wrote:
> [Disregard if not interested]
> 
> Dear all
> 
> Thanks for your emails. The issue of where the misunderstanding might lie is 
> clearer to me now, esp. given Gilles' example with his niece. 
> (@Anil: perhaps you are right in your observations in a possible style change 
> in my correspondences --- I may well have been running out of patience at 
> this point (considering I have been in rebuttal mode since at least 2019 
> [1]?! So it's a good thing that morphology is coming to an end!). In the 
> beginning, I had expected the professionals whom I expect to be experienced 
> in "language"/data matters (and the subscribers of the CorporaList) to be the 
> first to appreciate my results, but it turned out to be the other way around, 
> it seems. Those who have been exposed to fewer "language tales" [2] can be 
> quicker in getting it. But anyway, please allow me to explain again below.)
> 
> Most importantly, in the niece example, there are 2 things that should be 
> discerned from one another: 
> i. what the niece uttered [i.e. data/observation (do note also how the data 
> is collected: recorded or transcribed?)], and 
> ii. what one's interpretation/analysis of her utterance is [i.e. 
> interpretation/analysis of observation]. 
> 
> In "grammarese" formulation, the case in question is as follows: Gilles' 
> niece conjugated an irregular verb with a regular verb conjugation 
> pattern.[3] 
> 
> Gilles suspects that (linguistic) morphology exists (and/or is universal?) 
> because the pattern of the niece's utterance resembled one of the patterns 
> (sometimes formulated from "rules" [4]) often studied in literature on 
> morphology. 
> 
> Re "she clearly showed me that her way of learning languages did not 
> consisted in reading/listening to huge amounts of utterances ...":
> even if the niece had only been exposed to 10 utterances, if 8 of which 
> exhibit a certain pattern, and 2 of which are more irregular/outlier-like, 
> chances of her applying habits that are in line with the pattern observed 
> more often in the rarer/unobserved cases can be high --- and would you not 
> agree that's rather reasonable?
> There are or may be un-/subconscious *patterns*, sure. But I do not argue 
> against these, for such patterns do not have to be formulated in terms of 
> "stems"/"roots"/"affixes", and more importantly, most of these patterns 
> surface more often in books than in real life anyway. So the fact that one 
> believes that a morphological paradigm is to be formulated in a certain way 
> is pretty much a matter of preference of a (group of) researcher(s).  
> 
> Re "but she was able to learn some word formation rules from very few 
> examples": 
> what she "learned" might just be some patterns --- at least according to 
> your/our analysis here. That is, she might not have yet had much exposure to 
> "rules", but Gilles might have. (Hence his conviction of the reality of 
> morphology may be stronger.)
> 
> Re "In my humble opinion, this proves that morphology exists, if not in the 
> LLM matrixes, at least in the human brain": 
> I don't disagree with how one's mind can be clouded by archaic ideals or 
> theories. But shouldn't a better theory exist outside of the mind of a person 
> or a group of scientists as well? 
> 
> If one accounts for text data in its entirety, i.e. without disregarding or 
> adding in whitespaces, evaluate in bigger span (as mentioned in the rebuttal 
> here [5]), the notion of morphology is actually irrelevant to a comprehensive 
> study of (language) data. Wouldn't you agree? 
> With your plane and bird analogy: so you could claim that if you do insist on 
> cherry-picking from data, shouldn't your analyses still matter? Well, if they 
> don't generalize well, they may end up mattering to you only. 
> 
> Re "... (or issued from a colonialist point of view of Aves on the task at 
> hand…) and asking them to renounce this oh so obsolete bad habit": 
> I suppose it depends on which side of history one would like to be on too. 
> 
> I understand that it can be much harder for those who have lived in a country 
> where "language" activities (and/or the concept of "language") have been 
> officially and explicitly supported/promoted. This "privilege" now puts many 
> of us in a rather disadvantageous position in unlearning much. 
> 
> Re "ML based language models": 
> I don't know what you understand of these, but the logic behind such (e.g. a 
> probabilistic processing/interpretation of sequences) is often not far from 
> how "humans" are known to "process language(s)" --- which is why many 
> modeling experiments can bridge "both spheres" (though I believe many 
> experienced in modeling would buy less into this "human 'versus' machine" 
> narrative). 
> 
> @Gilles: I am also curious what your takeaway is from Quine's "Word and 
> Object" (e.g. at https://mitpress.mit.edu/9780262670012/word-and-object/ 
> <https://mitpress.mit.edu/9780262670012/word-and-object/>) in relation to our 
> conversation here. 
> 
> @Anil: the computational phenomenology is already in "Fairness in 
> Representation" (note that the insights were obtained from a collection of 
> many models, i.e. most of them are epi-phenomena). So I think what I have in 
> mind is orthogonal to what you described. Crimes and other misconduct have 
> also been around for millenia, are these things we want to keep?
> That having been clarified, do you have other objections to my contributions?
> 
> I hope I have addressed your concerns sufficiently. If not, please let me 
> know. 
> 
> Thanks and best
> Ada
> 
> 
> [1] The results that ending up getting published in Fairness in 
> Representation <https://openreview.net/forum?id=-llS6TiOew> (ICLR 2022) had 
> been rejected about 5 times, those in "Statistical (Un-)typology" (even with 
> "greedy" research incentives so to fit in) about another 5 times from May 
> 2019 to April 2022, in addition to other attempts/withdrawals. Then all I 
> have been dealing with is just retaliation. In fact, I just got some stuff 
> stolen and had to get things reported to the police, so please pardon my 
> delay in reply. 
> [2] At a point, I thought perhaps it'd be best to have no disciplines. Then I 
> realized not all disciplines are like "language", "linguistics", or 
> "structural linguistics". 
> That having been expressed, can having "no disciplines" be still a good 
> thing? Possibly, but another debate, another time, perhaps. 
> [3] But let's bear in mind: what one'd consider a "regular verb" (vs 
> "irregular verb") is nothing but some sequence/utterance seen/heard more 
> frequently than others. 
> [4] esp. in the history of "transformational grammar" that was popular around 
> the mid 20th century. "Grammar rules" might have been around for longer, but 
> branding things as within the domain of "morphology" as a module of a bigger 
> "structure"/"structural framework" of "linguistic analysis" is a matter that 
> has become more popular only in the past half a century or so due to 
> "transformational grammar" / "structural linguistics". 
> But please do note that even in "structural linguistics", many patterns are 
> explained away in terms of (the ranking of) constraints (i.e. no 
> "transformation"). There are no/few reasons to posit the notion of "deep 
> structure(s)", from/through which, in the case of morphological analyses, 
> "stems"/"roots" get to be held often as the bases of inflection. That is, 
> aside from "grammar rules" taught in e.g. schools and those inside of 
> researchers' mind, evidence for the existence of "rules" is actually rather 
> little, if any. [N.B. this can be considered advanced for those who didn't 
> have a theoretical background in Linguistics.]
> [5] https://openreview.net/forum?id=-llS6TiOew 
> <https://openreview.net/forum?id=-llS6TiOew>
> 
> 
> 
> On Thu, Oct 26, 2023 at 6:05 PM Anil Singh <[email protected] 
> <mailto:[email protected]>> wrote:
> I have also been carefully reading the exchanges. Although I was planning not 
> to add to this exchange, at this point I am tempted to reply.
> 
> Ada's early emails were adding something to the discussion and debate, but at 
> this point they are simply saying 'I am right, you are wrong', without giving 
> any explanation or evidence. 
> 
> I was also thinking of the same kind of examples as given by Gilles. Till Ada 
> provides some very good reasoning and evidence, it is hard for me to 
> completely agree with her, although as I said earlier, I do agree with her on 
> many, perhaps most of things. 
> 
> Ada, I sincerely respect your learning and competence. However, you said 
> earlier you are proposing an alternative computational phenomenology. That 
> would be really interesting. Won't it be better to first propose it and argue 
> in more specific terms and with more convincing arguments and evidence that 
> it is the right one, or at least 'more right' than the existing ones (there 
> are more than one). Given that there is already Information Theory, it has to 
> go beyond byte, which is an accidental unit of computation, and character, 
> which is also not well-defined, sometimes even for one specific writing 
> system. To give one such example, perhaps not the best one, I always thought 
> of Indic script dependent vowel (maatraa) as a character, but I recently 
> found that languages like Java and Python do not treat such written symbols 
> as character, so when I try to get the length of an Indic-script string, the 
> in-built string length functions give only the number of consonant symbols 
> and independent vowels in the string.  We got wrong results using these 
> functions and I only accidentally discovered that this is the case. The 
> reason, of course, is that these functions and programming languages treat 
> such dependent vowels as diacritics, which is also correct in some ways. I 
> did not realize this earlier because in India we often use a Latin 
> script-based notation called WX for Indic scripts in NLP due to the encoding 
> and input method related problems that I referred to in one of my earlier 
> replies. The WX notation, however, does not distinguish between dependent and 
> independent vowels and treats both of them as the same character, which is 
> how most of us, if not all, think of them in India to the best of my 
> knowledge. On the other hand, the consonant symbol modifier 'halant' is not 
> used in WX, but is used in Indic-scripts and its presence might also cause 
> disagreements about what the string length is. In other words, character as a 
> unit does not work in your terms. In fact, who knows how many errors for 
> Indic script text have made their way into computational results due to this 
> simple fact. And perhaps they still do because it took me a long time to 
> realize this, which at first led to consternation, because in text processing 
> if you can't rely on the string length function, what can you rely on?
> 
> As for phonemes, major ML researchers like Vincent Ng don't believe it to be 
> a real unit of language. The argument is that we don't need phonemes for 
> applications like speech recognition. 
> 
> If not byte and character, what are we left with in terms of computational 
> phenomenology? At the very least there has to be such a well-argued and 
> well-evidenced alternative in order to try to persuade others to agree to 
> your views. I would be very much interested in thinking about such an 
> alternative even if at present I don't think you are right about all your 
> views. After all, to throw away millenia of work on language-science, very 
> strong reasoning and evidence for an alternative is not an unrealistic 
> expectation.
> 
> On Thu, Oct 26, 2023 at 8:44 PM Gilles Sérasset via Corpora 
> <[email protected] <mailto:[email protected]>> wrote:
> Hi Ada,
> 
> When my niece was 3 year old, she said to her little brother “Maman, elle 
> venira plus tard…” (Mum will come back later, in “incorrect” French).
> 
> She made a “mistake" here by using “venira” (a wrong future form for verb 
> venir (to come)) instead of the “correct" “viendra”. It was wrong, but 
> perfectly predictable using the most productive morphological rules of French 
> future formation.
> 
> She was 3 years old, so I doubt she was really understanding what morphology 
> is, nevertheless, with this mistake, she clearly showed me that her way of 
> learning languages did not consisted in reading/listening to huge amounts of 
> utterances but she was able to learn some word formation rules from very few 
> examples. And indeed, human is still able to perfectly learn complex things 
> with very small explanation and/or very few example (something that is 
> totally beyond ML based language models).
> 
> In my humble opinion, this proves that morphology exists, if not in the LLM 
> matrixes, at least in the human brain. Hence modelling such rules (and even 
> using them to analyse or produce) is a valid approach, independently of any 
> other (also valid) approaches.
> 
> If I want to say it another way : 
> 
> There has been many scientific proofs that human will not be able to fly… And 
> these proofs were valid under their own hypothesis.
> 
> Indeed, planes do not flap their wings… they are using other ways to perform 
> a task that was performed by birds.
> 
> Nevertheless, I have never been the witness of any plane (or pilot) trying to 
> convince birds that their way of flying is obsolete (or issued from a 
> colonialist point of view of Aves on the task at hand…) and asking them to 
> renounce this oh so obsolete bad habit.
> 
> Regards,
> 
> Gilles,
> 
> _______________________________________________
> Corpora mailing list -- [email protected] <mailto:[email protected]>
> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ 
> <https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/>
> To unsubscribe send an email to [email protected] 
> <mailto:[email protected]>
> 
> 
> -- 
> - Anil

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Re: Lemmas and Lemmatization [was Re: NIF: NLP Interchange Format]

Reply via email to