Dear Johannes Re "CreoleVal": at this point, it's more like a "one shouldn't" as opposed to whether "one can".
The following is what I wrote to the SIGTYP, I think the message would be similar for your initiative: """ ---------- Forwarded message --------- From: Ada Wan <[email protected]> Date: Tue, Oct 31, 2023 at 6:47 PM Subject: Re: [Corpora-List] First CFP: The 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP (SIGTYP 2024) To: Michael Hahn <[email protected]>, <[email protected]> Cc: <[email protected]> Dear Michael, dear SIGTYP officers and workshop organizers I saw this posting of yours and have some concerns re the orientation of this workshop/event. Given the work by Mielke et al. (2019) and Wan (2022), I am surprised to see how the workshop description seems not to have been updated accordingly. I have some questions: i. would/could such event/initiatives contribute to misinforming academics, professionals, and practitioners (including those who may be new to the topic)? ii. at what granularity (e.g. "word", character, or byte) will "linguistic typology" be promoted through this workshop/event? iii. what is/are the "discipline-specific narrative(s)" (default expectations of a discipline), if any, that is/are supposed to hold still, esp. after the 2 publications mentioned above? iv. how is "language" defined for the aim(s)/purpose(s) of your workshop? and v. since the initiatives of the workshop are computing-related, is character encoding (an area that has been severely overlooked in the past in Computational Linguistics / Natural Language Processing) being used/promoted/introduced? One major ethical consideration in the area of "linguistic typology" is that it could unnecessarily exacerbate differences between language varieties, esp. if/when such differences are not observable unless one creates them through "word" (or "word"-like) tokenization in the preprocessing step. It would be a violation of scientific integrity if one were to continue "word"-hacking (in another formulation: intentionally discarding data) in the name of "linguistic typology", would you not agree? I look forward to your replies. Thanks and best Ada """ Thanks and best Ada On Tue, Oct 31, 2023 at 8:59 PM Johannes Bjerva via Corpora < [email protected]> wrote: > We are proud to announce the release of CreoleVal - a collection of > benchmarks for 28 Creole languages. The collection of datasets span tasks > such as relation classification, machine comprehension, machine > translation, named entity recognition, and use cases such as language > modeling. We cover Haitian Creole, Bislama, Chavacano, Pitkern, Singlish, > Tok Pisin, Papiamento, and others. > > We hope the NLP community will include this collection of datasets in > ongoing & future evaluations of methods directed at low-resource languages. > Not only that, we also hypothesise that CreoleVal will open the door for > controlled experimentation with transfer learning methodology. > > This resource has been long in the making, and was made possible by a long > list of collaborators. > > For a pre-print, see: https://arxiv.org/abs/2310.19567 > > For code and data, see: https://github.com/hclent/CreoleVal > (Repository under construction) > > _______________________________________________ > Corpora mailing list -- [email protected] > https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ > To unsubscribe send an email to [email protected] >
_______________________________________________ Corpora mailing list -- [email protected] https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to [email protected]
