[Corpora-List]Re: Complex Word Identification in French

Miloš Jakubíček Wed, 22 Jun 2022 04:42:38 -0700

Hi Ada,

a very good paper (and lot of work done - congrats!) and a very interesting
thread.


Clearly linguistics as a field is terribly lacking some unified taxonomy
(compared to biology, chemistry, whatever -- the difference is rather
striking) and yes, this is causing serious trouble in NLP and general (by
linguists spending time to promote "their" definition instead of promoting
harmonization regardless of their own preferences and traditions). But
"word" here is not a special case -- it's just one of the many linguistic
terms that are ill-defined.

When you say:

*It is of the best interest of the community to discontinue the usage of
"word".*

you must realize this is never going to happen (and just results in getting
an Orwellian hail, quite understandably, you'll not find a lot of support
for prescriptive views on language on this list :).
This is for both practical reasons ("word" is a word everyone is used to
use, it is normal, and being normal is the strongest card you can play in
language) as well as theoretical ones (the problem is not the string-label,
but the concept itself, so changing the label is of no help here)

So, taking this practically:
- you may try to convince linguists to make some harmonized taxonomy (and I
wished I knew how to be of any help here, but I don't think that majority
is seeing it as a problem at all - they see it as "complexity/property of
language" - so I don't believe this is going to happen anytime soon)
- you may promote the awareness of how word-processing impacts performance
and evaluation of NLP tools

The latter I think is much more useful and more likely to at least
partially succeed -- and I was really happy to see that your paper
quantifies that in state-of-the-art techniques (you may have a look at a
2016 PhD thesis of one of my ex-colleagues:
https://is.muni.cz/th/en6ay/thesis.pdf who investigated character based
language modelling).
It is an omnipresent problem that actually starts with English and simple
tasks like PoS tagging in a slightly different way: the impact of not
following exactly the same tokenization (esp. for high frequency items like
"don't") the tagger expects/was trained on, is huge.
In Sketch Engine where we use tons of word-processing tools (mostly
segmenters/stemmer/PoS tagger/lemmatizers), getting the input tokenization
right is often the most difficult job; and of course, more so for languages
where word is more of an artificial and arbitrary concept.
>From a non-academic perspective, the issue is that users expected something
familiar like words, and many are aware of the level of arbitrariness of
"words" in their focus language.

So, reading:

*Well, talk to the NLP crowd or the ones who expect LM/MT results from
different languages should have different performances, even if/when all
else were equal. (I remember how hard and how many rounds I had to work for
my rebuttals....)*

Indeed, this is very much what everyone is used to :-/
>From a purely technical perspective, switching to characters (or bytes --
which however are not a good level of abstraction in terms of
interpretation, especially with variable-length encodings like UTF-8)
sometimes is the right thing to do (though the figures get easily skewed --
esp. in a web corpus with plenty of long URLs)
And sometimes the desired level of abstraction is the other way around,
arriving at MWE's being even more of a nightmare than poor "words",
whatever they represent.
And sometimes, the best way is to keep using "words" with lots of policing
around what they are, which btw might very well be Christopher's case with
French.
Which way to go depends on particular use cases, so: is "word" a well
defined unit? Certainly not. Is it a useful one? Sometimes yes.

Cheers and "*I wished the world would give more worth to data*"-too
Milos


Milos Jakubicek

CEO, Lexical Computing
Brno, CZ | Brighton, UK
http://www.lexicalcomputing.com
http://www.sketchengine.eu


On Mon, 20 Jun 2022 at 17:34, Ada Wan <[email protected]> wrote:

> Hi Christopher,
>
> It is of the best interest of the community to discontinue the usage of
> "word". The term is not only very shaky in its foundation (if any), but it
> can also effect disparity in performance in computational processing and
> robustness when human evaluation is involved.
> Despite the term has been casually adopted by many in the past, like many
> un-PC terms that may have an inappropriate undertone, it needs to be
> discouraged and abandoned.
> Last but not least, I noticed that you are located in Canada, in the event
> that you were to work with any indigenous communities, one MUST be advised
> to be careful with the usage of such term --- you could be imposing your
> own (EN- / FR- / dominant language-centric) view onto another
> individual/community. There is an element of cultural and
> linguistic hegemony with the usage of such term (including and not limited
> to making applications with it).
> Please also consult recent work in this area:
> https://openreview.net/forum?id=-llS6TiOew.
>
> Feel free to get in touch if you should have any questions.
>
> Best,
> Ada
>
>
> On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins <
> [email protected]> wrote:
>
>> Hello,
>>
>>
>>
>> I’m looking for any open source or cloud-hosted solution for complex word
>> identification or word difficulty rating in French for a reading
>> application.
>>
>>
>>
>> As a backup plan we can use measures like corpus frequency, length,
>> number of senses, but we’re hoping someone has already made a tool
>> available.
>>
>>
>>
>> We found this but that’s it: https://github.com/sheffieldnlp/cwi
>>
>>
>>
>> Would appreciate any tips!
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Chris
>>
>>
>>
>> *Christopher Collins *[he/him
>> <https://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743>
>> ]
>> Associate Professor - Faculty of Science
>> Canada Research Chair in Linguistic Information Visualization
>> Ontario Tech University
>> vialab.ca
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Corpora-List]Re: Complex Word Identification in French

Reply via email to