The next meeting of the Edge Hill Corpus Research Group will take place online 
(via MS Teams) on Thursday 25 April 2024, 2:00-3:30 pm (UK time).

Attendance is free. You can register here:
https://store.edgehill.ac.uk/conferences-and-events/conferences/events/edge-hill-corpus-research-group-thursday-25th-april-2024

Topics: Corpus Methodology, Large Language Models

Speakers: Sylvia 
Jaworska<https://www.reading.ac.uk/elal/staff/dr-sylvia-jaworska> (University 
of Reading, UK) & Mathew 
Gillings<https://www.wu.ac.at/ebc/about-us/team/mathew-gillings/> (Vienna 
University of Economics and Business, Austria)

Title: How humans vs. machines identify discourse topics: an exploratory 
triangulation

Abstract

Identifying discourses and discursive topics in a set of texts has not only 
been of interest to linguists, but to researchers working across social 
sciences. Traditionally, these analyses have been conducted based on 
small-scale interpretive analyses of discourse which involve some form of close 
reading. Naturally, however, that close reading is only possible when the 
dataset is small, and it leaves the analyst open to accusations of bias and/or 
cherry-picking.

Designed to avoid these issues, other methods have emerged which involve larger 
datasets and have some form of quantitative component. Within linguistics, this 
has typically been through the use of corpus-assisted methods, whilst outside 
of linguistics, topic modelling is one of the most widely-used approaches. 
Increasingly, researchers are also exploring the utility of LLMs (such as 
ChatGPT) to assist analyses and identification of topics. This talk reports on 
a study assessing the effect that analytical method has on the interpretation 
of texts, specifically in relation to the identification of the main topics. 
Using a corpus of corporate sustainability reports, totalling 98,277 words, we 
asked 6 different researchers, along with ChatGPT, to interrogate the corpus 
and decide on its main ‘topics’ via four different methods. Each method 
gradually increases in the amount of context available.

•       Method A: ChatGPT is used to categorise the topic model output and 
assign topic labels;
•       Method B: Two researchers were asked to view a topic model output and 
assign topic labels based purely on eyeballing the co-occurring words;
•       Method C: Two researchers were asked to assign topic labels based on a 
concordance analysis of 100 randomised lines of each co-occurring word;
•       Method D: Two researchers were asked to reverse-engineer a topic model 
output by creating topic labels based on a close reading.

The talk explores how the identified topics differed both between researchers 
in the same condition, and between researchers in different conditions shedding 
light on some of the mechanisms underlying topic identification by machines vs 
humans or machines assisted by humans. We conclude with a series of tentative 
observations regarding the benefits and limitations of each method along with 
suggestions for researchers in selecting an analytical approach for discourse 
topic identification. While this study is exploratory and limited in scope, it 
opens up a way for further methodological and larger scale triangulations of 
corpus-based analyses with other computational methods including AI.

If you have any questions, please contact the organiser, Costas Gabrielatos 
([email protected]<mailto:[email protected]>)

  ________________________________
Edge Hill University<http://ehu.ac.uk/home/emailfooter>
Modern University of the Year, The Times and Sunday Times Good University Guide 
2022<http://ehu.ac.uk/tef/emailfooter>
University of the Year, Educate North 2021/21
  ________________________________
This message is private and confidential. If you have received this message in 
error, please notify the sender and remove it from your system. Any views or 
opinions presented are solely those of the author and do not necessarily 
represent those of Edge Hill or associated companies. Edge Hill University may 
monitor email traffic data and also the content of email for the purposes of 
security and business communications during staff 
absence.<http://ehu.ac.uk/itspolicies/emailfooter>
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to