The next meeting of the Edge Hill Corpus Research Group will take place online 
(via MS Teams) on Thursday 29 February 2024, 2:00-3:30 pm (GMT).

Attendance is free. You can register here:
https://store.edgehill.ac.uk/conferences-and-events/conferences/events/edge-hill-corpus-research-group-thursday-29th-february-2024

Registration closes on Wednesday 28 February, 11 am (GMT)

Topic: Corpus Methodology

Speaker: Matteo Di Cristofaro<https://infogrep.it/site/> (University of Modena 
and Reggio Emilia, Italy)

Title: One dataset, many corpora: Problems of scientific validity in corpora 
and corpus-derived results

   Abstract

Corpus linguistics has, since its inception, recognised the relevance of 
digital technologies as a major driving force behind corpus techniques and 
their (r)evolution in the study of language (cf. Tognini-Bonelli 2012). And 
yet, while both corpus linguistics and digital technologies have frequently 
benefited from each other (the case of NLP/NLU is one such macro example), 
their pathways have often diverged. The result is a disconnect between corpus 
linguistics and digital data processing whose effects directly impinge on the 
ability to analyse language through software tools. A disconnect becoming more 
and more relevant as corpus linguistics is being applied to vast amounts of 
data obtained from manifold sources – including a wide array of social media 
platforms, each one with its unique linguistic and technical peculiarities.

As the ground-truth of an ever-increasing number of language studies, corpora 
must be able to correctly treat and represent such peculiarities: e.g. the 
dialogic dimension of comments or forum posts; the presence (and potential 
subsequent normalisation) of spelling variations; the use of hashtags and 
emojis. Failing to do so, the corpus-derived results will likely present 
researchers with a falsified view of the language under scrutiny.

What is at stake is not the ability to “count” what is in a corpus, but rather 
whether what is being counted is or is not a feature present in the original 
data – of which the corpus should be a faithful representation.

The presentation is consequently devoted to tackling digital technicalities, 
i.e. “those notions and mechanisms that – while not classically associated with 
natural language – are i) foundational of the digital environments in which 
language production and exchanges occur and ii) at the core of the techniques 
that are used to produce, collect, and process the focus of investigation, that 
is, digital textual data.” (Di Cristofaro 2023:5). One such example is 
represented by character encodings: although at the “core” of the whole corpus 
linguistics enterprise (cf. McEnery and Xiao 2005; Gries 2016:39,111) – since 
they allow written language to be processed by a computer and understood by 
humans -, these are often overlooked at all stages of corpus compilation and 
analysis, potentially leading linguists to involuntarily tampering with the 
data and its linguistic contents.

Starting from practical examples, the presentation discusses the implications 
that digital technicalities have on corpora and their analyses – or rather, 
what happens when they are not properly treated – while outlining (also in the 
form of Python scripts and practical tools) potential new pathways that a 
“digital-aware” perspective of corpus linguistics can open up.

    References

    Di Cristofaro, Matteo. Corpus Approaches to Language in Social Media. 
Routledge Advances in Corpus Linguistics. New York: Routledge, 2023. 
https://doi.org/10.4324/9781003225218<https://doi.org/10.4324/9781003225218>.
    Gries, Stefan Th. Quantitative Corpus Linguistics with R: A Practical 
Introduction. 2nd ed. New York: Routledge, 2016. 
https://doi.org/10.4324/9781315746210<https://doi.org/10.4324/9781315746210>.
    McEnery, Tony, and Richard Xiao. ‘Character Encoding in Corpus 
Construction’. In Developing Linguistic Corpora: A Guide to Good Practice, 
edited by Martin Wynne, 47–58. Oxford: Oxbow Books, 2005. 
https://users.ox.ac.uk/~martinw/dlc/index.htm<https://users.ox.ac.uk/~martinw/dlc/index.htm>.
    Tognini Bonelli, Elena. ‘Theoretical Overview of the Evolution of Corpus 
Linguistics’. In The Routledge Handbook of Corpus Linguistics, edited by Anne 
O’Keeffe and Michael McCarthy, 14–27. Routledge Handbooks in Applied 
Linguistics. Milton Park, Abingdon, Oxon ; New York: Routledge, 2012.

  ________________________________
Edge Hill University<http://ehu.ac.uk/home/emailfooter>
Modern University of the Year, The Times and Sunday Times Good University Guide 
2022<http://ehu.ac.uk/tef/emailfooter>
University of the Year, Educate North 2021/21
  ________________________________
This message is private and confidential. If you have received this message in 
error, please notify the sender and remove it from your system. Any views or 
opinions presented are solely those of the author and do not necessarily 
represent those of Edge Hill or associated companies. Edge Hill University may 
monitor email traffic data and also the content of email for the purposes of 
security and business communications during staff 
absence.<http://ehu.ac.uk/itspolicies/emailfooter>
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to