[Corpora-List] Edge Hill Corpus Research Group, Thursday 29 February 2024, 2:00-3:30 pm (GMT)

Costas Gabrielatos via Corpora Sat, 10 Feb 2024 03:30:15 -0800

The next meeting of the Edge Hill Corpus Research Group will take place online 
(via MS Teams) on Thursday 29 February 2024, 2:00-3:30 pm (GMT).

Attendance is free. You can register here:
https://store.edgehill.ac.uk/conferences-and-events/conferences/events/edge-hill-corpus-research-group-thursday-29th-february-2024

Registration closes on Wednesday 28 February, 11 am (GMT)

Topic: Corpus Methodology

Speaker: Matteo Di Cristofaro<https://infogrep.it/site/> (University of Modena
and Reggio Emilia, Italy)

Title: One dataset, many corpora: Problems of scientific validity in corpora
and corpus-derived results

Abstract

Corpus linguistics has, since its inception, recognised the relevance of
digital technologies as a major driving force behind corpus techniques and
their (r)evolution in the study of language (cf. Tognini-Bonelli 2012). And
yet, while both corpus linguistics and digital technologies have frequently
benefited from each other (the case of NLP/NLU is one such macro example),
their pathways have often diverged. The result is a disconnect between corpus
linguistics and digital data processing whose effects directly impinge on the
ability to analyse language through software tools. A disconnect becoming more
and more relevant as corpus linguistics is being applied to vast amounts of
data obtained from manifold sources – including a wide array of social media
platforms, each one with its unique linguistic and technical peculiarities.

As the ground-truth of an ever-increasing number of language studies, corpora
must be able to correctly treat and represent such peculiarities: e.g. the
dialogic dimension of comments or forum posts; the presence (and potential
subsequent normalisation) of spelling variations; the use of hashtags and
emojis. Failing to do so, the corpus-derived results will likely present
researchers with a falsified view of the language under scrutiny.

What is at stake is not the ability to “count” what is in a corpus, but rather
whether what is being counted is or is not a feature present in the original
data – of which the corpus should be a faithful representation.

The presentation is consequently devoted to tackling digital technicalities,
i.e. “those notions and mechanisms that – while not classically associated with
natural language – are i) foundational of the digital environments in which
language production and exchanges occur and ii) at the core of the techniques
that are used to produce, collect, and process the focus of investigation, that
is, digital textual data.” (Di Cristofaro 2023:5). One such example is
represented by character encodings: although at the “core” of the whole corpus
linguistics enterprise (cf. McEnery and Xiao 2005; Gries 2016:39,111) – since
they allow written language to be processed by a computer and understood by
humans -, these are often overlooked at all stages of corpus compilation and
analysis, potentially leading linguists to involuntarily tampering with the
data and its linguistic contents.

Starting from practical examples, the presentation discusses the implications
that digital technicalities have on corpora and their analyses – or rather,
what happens when they are not properly treated – while outlining (also in the
form of Python scripts and practical tools) potential new pathways that a
“digital-aware” perspective of corpus linguistics can open up.

References

Di Cristofaro, Matteo. Corpus Approaches to Language in Social Media.
Routledge Advances in Corpus Linguistics. New York: Routledge, 2023.
https://doi.org/10.4324/9781003225218<https://doi.org/10.4324/9781003225218>.
Gries, Stefan Th. Quantitative Corpus Linguistics with R: A Practical
Introduction. 2nd ed. New York: Routledge, 2016.
https://doi.org/10.4324/9781315746210<https://doi.org/10.4324/9781315746210>.
McEnery, Tony, and Richard Xiao. ‘Character Encoding in Corpus
Construction’. In Developing Linguistic Corpora: A Guide to Good Practice,
edited by Martin Wynne, 47–58. Oxford: Oxbow Books, 2005.
https://users.ox.ac.uk/~martinw/dlc/index.htm<https://users.ox.ac.uk/~martinw/dlc/index.htm>.
Tognini Bonelli, Elena. ‘Theoretical Overview of the Evolution of Corpus
Linguistics’. In The Routledge Handbook of Corpus Linguistics, edited by Anne
O’Keeffe and Michael McCarthy, 14–27. Routledge Handbooks in Applied
Linguistics. Milton Park, Abingdon, Oxon ; New York: Routledge, 2012.

________________________________
Edge Hill University<http://ehu.ac.uk/home/emailfooter>
Modern University of the Year, The Times and Sunday Times Good University Guide
2022<http://ehu.ac.uk/tef/emailfooter>
University of the Year, Educate North 2021/21
________________________________
This message is private and confidential. If you have received this message in
error, please notify the sender and remove it from your system. Any views or
opinions presented are solely those of the author and do not necessarily
represent those of Edge Hill or associated companies. Edge Hill University may
monitor email traffic data and also the content of email for the purposes of
security and business communications during staff
absence.<http://ehu.ac.uk/itspolicies/emailfooter>

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Edge Hill Corpus Research Group, Thursday 29 February 2024, 2:00-3:30 pm (GMT)

Reply via email to