Hi Amir,
Many thanks for getting in touch, and for letting me know about this. I
think the student in this case wants full texts, but this data looks
very interesting, particularly with the rich annotation, so will
definitely be useful for a number of use cases, and I'll add it to my
summary for the list.
Best wishes,
Martin
On 28/10/2024 20:16, [email protected] wrote:
Hi Martin,
I'm not sure if this will help, but if your student is interested in doing
something with richly annotated Gutenberg data, there is a very deeply
annotated corpus including about 0.5M tokens of samples from Project Gutenberg
novels here, next to data from 7 other genres:
https://github.com/gucorpling/amalgum/blob/dev/amalgum/fiction/dep/AMALGUM_fiction_adams.conllu
The data is automatically annotated with good quality neural UD parses,
coreference resolution, entity recognition, discourse parses and more, with
excerpts from over 400 novels included. We also have a much smaller but
manually annotated corpus which includes fiction, along with other genres in
our GUM/GENTLE corpora (24 genres total):
https://gucorpling.org/gum/
Hope these are useful,
Amir
------------
Dr. Amir Zeldes
Assoc. Prof. of Computational Linguistics
Department of Linguistics
Georgetown University
1437 37th St. NW
Washington, DC 20057
https://gucorpling.org/amir
-----Original Message-----
From: Martin Wynne via Corpora <[email protected]>
Sent: Sunday, October 27, 2024 8:11 AM
To: [email protected]
Subject: [Corpora-List] Corpora of English novels
I have a student who is interested in tracing the development of the English
novel from its origins to the present day (or at least to the start of the
twentieth century), and I'm trying to gather information about relevant corpora
covering this text type and period.
We know about the European Literary Text Collection (ELTeC,
https://www.google.com/url?q=https://www.distant-reading.net/eltec/&source=gmail-imap&ust=1730635931000000&usg=AOvVaw2Y1rJdwNxnHfCqswyPsa22)
which will be very useful for the later end of the timescale. We also know it is possible
to assemble a corpus from Project Gutenberg, archive.org, Oxford Text Archive, etc.
, but would be interested in re-using any corpora that people might already
have made, which aim to be representative of particular periods within this
genre.
The student has some flexibility with her research question, so while the
original idea of 'English novels' was probably 'novels in English from Great
Britain and Ireland', other related areas such as US novels might be
interesting as well.
Any tips and suggestions gratefully received. If we get a number of interesting
direct emails, I'll be happy to summarize the results to the list.
Best wishes,
Martin
--
Senior Researcher in Corpus Linguistics
Faculty of Linguistics, Philology and Phonetics, University of Oxford National
Co-ordinator, CLARIN-UK [email protected]
https://www.google.com/url?q=https://orcid.org/0000-0002-4155-0530&source=gmail-imap&ust=1730635931000000&usg=AOvVaw1i_exZAWOHquyE8Wlol7Le
_______________________________________________
Corpora mailing list -- [email protected]
https://www.google.com/url?q=https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/&source=gmail-imap&ust=1730635931000000&usg=AOvVaw3ExL6BwTVsV7vY84JjtMck
To unsubscribe send an email to [email protected]
--
Senior Researcher in Corpus Linguistics
Faculty of Linguistics, Philology and Phonetics, University of Oxford
National Co-ordinator, CLARIN-UK
[email protected]
https://orcid.org/0000-0002-4155-0530
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]