[Corpora-List] Re: RANLP 2023 Call for Participation

Gilles Sérasset via Corpora Wed, 30 Aug 2023 07:35:57 -0700

Dear Ada, dear all,

I am not a linguist but a computational scientist which is quite used to talk 
with (and tries to understand) linguists. I must say that I usually read your 
mails as thoroughly as my schedule and patience allows me to, but, to be 
honest, I also have a rather negative feeling when reading your “discourse”.


In this discourse, I see facts + interpretation + rhetorics.

[Here I take the risk of caricaturing for the sake of shortness, I hope you 
will understand that I have no time nor intention to really go deeply in all 
the intricacies of your different claims as I am more a witness than an actor 
of this scientific dispute]

My understanding of your facts: Neural models do not use the concept of word in 
any of their tasks, but achieve very interesting results in their modelling of 
the language.

My understanding of your interpretation: this is the proof that there is no 
such thing as a word.

My understanding of your rhetoric: linguists are still using “words”, so they 
are wrong or dishonest or miseducated or dumb, we should wipe out entirely any 
occurence of this concept and start over with another modelling of the language.

Please, understand that I am just presenting the way I am interpreting your 
different messages. And even if I am wrong here, this interpretation is to be 
taken into account as we are all persons with feeling. This feeling is a fact, 
even if I do not particularly feel targeted by your different criticisms. I 
hope this will help you ponder the terms involved in your next messages.

This being said, I was not particularly surprised to see some “passionate” 
replies to your different messages. And I agree with everyone here, we should 
not go into such passion and use ad-hominem attacks on a mailing list, AND you 
should also understand that most of your rhetoric do contains such passion and 
attacks.




Concerning the facts :

You are right, Neural models does not use any notion of word (or word 
morphology) as it is usually thought in linguistics as it usually first decide 
what is the granularity with which it will aggregate its input (sequence of 
characters) into tokens to which it attaches an “interpretation” (modelled as a 
multi-dimensional vector).




Concerning the interpretation : 

1. You want to wipe out the notion of word based on such a fact. I would agree 
somehow if we were dealing with a universal modelling of language, but this is 
not the case. Human model language in a certain way and neural models in 
another way (even if neural networks are claimed to be inspired by biological 
neurones in our brains). The fact that a concept does not exist in a model does 
not entail that it does not exist in another model.


2. Also, you do make the very same mistake concerning the way you look at the 
facts: i.e. there is no such thing as a character…, which means that the input 
of NN is already flown with a bias with which we look at language. Indeed 
characters are a very recent invention that builds on different concerns:
 - usual graphical elements that are traditionally used in language writing and 
that has been interpreted as atomic,
 - their interpretation by the encoding authorities (see the differences and 
debates about code points vs characters)
 - arbitrary decision made (e.g. why model A and a as 2 different characters?)  
Moreover, all corpora are usually badly encoded by using one character for 
another (quote instead of apostrophe, unbreakable character instead of a space, 
…) and this only accounts for languages with a writing system or transcription, 
i.e. not the majority of them.

The conclusion is that even Neural Network uses artificial bias in the way they 
model language, which means that the conclusion we draw from them are as flawed 
as the one we draw from the classical way linguists look at languages.


3. Most serious linguists never defined “words” lightly and most of them know 
that this concept is an "approximation” of something that is very difficult to 
apprehend and seems to be more grounded into linguistics from human 
introspection than linguistics from corpora. It somehow represents the way our 
human brain aggregates the atoms of the language (characters/phonemes) into 
something to which we associate an interpretation. In this sense, it is somehow 
the “tokens” of our biological neural network (and certainly far more).

As an utterance production is not a bijection between whatever we have in our 
head and the sequential signal we use to communicate, I agree with you on the 
fact that “words" are certainly not present in a corpus (but I do think that 
our inner “tokens” may be observed somehow there).


Concerning the rhetoric:

I do not think any linguist or computational linguist is naive enough to think 
that any of the modelling we deal with are a “truth” and I doubt any of them is 
miseducated enough to think that “words” are clearly defined and undoubtedly 
present in corpora. I do think though that they are usually right to observe 
occurrences (or hints) of non atomic constructs we associate with some 
interpretation. I also think that this way of looking to a corpus has some 
advantages that are not really present in NN (for instance, it can observe some 
regularity that will help human produce new utterances without being shown a 
large amount of examples).

I also do think that even if you were totally right in your facts and 
interpretations, asking for a denial of current/past ways of looking to the 
texts will be a mistake. Even in physics, since the general theory of 
relativity, we know the classical mechanics is wrong, however it is still in 
use and it is not a problem as long as everybody know under which hypothesis it 
is a good enough approximation and under which hypothesis it does not work 
anymore. 
  


I know this message will certainly not make you think differently, but if it 
allows you to communicate differently with persons that still use the terms 
“words" or “sentences" as a simple shortcut to position their work into a 
shared/common understanding of the state of the art, in contexts where there is 
no room for better explanation (e.g. in summaries of their keynote speech), 
then I will have achieved something.

Hoping this scientifical debate will continue in an appeased manner, 

Regards, 

Gilles Sérasset,

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Re: RANLP 2023 Call for Participation

Reply via email to