[Corpora-List] Re: Corpora of English novels

Martin Wynne via Corpora Mon, 02 Dec 2024 04:39:06 -0800

To summarize the replies to my query for corpora of English novels, thefollowing were suggested (in addition to ELTeC, mentioned in the initialquery - European Literary Text Collectionhttps://www.distant-reading.net/eltec/). Due to some difficulties inlocating and accessing the corpora listed below under the first twoitems, the corpus creators have agreed to deposit them in the OxfordText Archive collections - old and new links are given below:

1.

Corpus of Late Modern English Texts (there is a version called CLMET 3.1as well as the extended version) and the CEN (Corpus of English Novels)compiled by Hendrik de Smet and others:


CLMET now available here: http://hdl.handle.net/20.500.14106/2574
CEN now available here: http://hdl.handle.net/20.500.14106/2573

2.
Corpus of the Canon of Western Literature

https://www.researchgate.net/publication/321773386_Introducing_the_Corpus_of_the_Canon_of_Western_Literature_A_corpus_for_culturomics_and_stylistics(downloadable in full fromhttps://www.researchgate.net/publication/385291433_CCWL_10_Jan_2018rar ).


CCWL now also available here: http://hdl.handle.net/20.500.14106/2575

3.

Amalgum: richly annotated Gutenberg data in the form of a very deeplyannotated corpus including about 0.5M tokens of samples from ProjectGutenberg novels, next to data from 7 other genres:


https://github.com/gucorpling/amalgum/blob/dev/amalgum/fiction/dep/AMALGUM_fiction_adams.conllu

The data is automatically annotated with good quality neural UD parses,coreference resolution, entity recognition, discourse parses and more,with excerpts from over 400 novels included. We also have a much smallerbut manually annotated corpus which includes fiction, along with othergenres in our GUM/GENTLE corpora (24 genres total):https://gucorpling.org/gum/

Many thanks to Bea Alex, Sabine Bartsch, Hendrik de Smet Clarence Greenand Amir Zeldes

I note that this hasn't revealed a great number of available corpora! Isuspect that there a large number of ad hoc datasets that people makefor specific projects, sampling from the large number of textcollections available, but I still don't find many open access referencecorpora for particular genres and time periods.

The CLARIN Resource Family for Literary Corpora(https://www.clarin.eu/resource-families/literary-corpora) has a numberin other languages, and I'll get this updated with the above corpora andanything else that I find for English.


Best wishes,
Martin Wynne


On 27/10/2024 12:11, Martin Wynne via Corpora wrote:

I have a student who is interested in tracing the development of theEnglish novel from its origins to the present day (or at least to thestart of the twentieth century), and I'm trying to gather informationabout relevant corpora covering this text type and period.
We know about the European Literary Text Collection (ELTeC,https://www.distant-reading.net/eltec/) which will be very useful forthe later end of the timescale. We also know it is possible toassemble a corpus from Project Gutenberg, archive.org, Oxford TextArchive, etc. , but would be interested in re-using any corpora thatpeople might already have made, which aim to be representative ofparticular periods within this genre.
The student has some flexibility with her research question, so whilethe original idea of 'English novels' was probably 'novels in Englishfrom Great Britain and Ireland', other related areas such as US novelsmight be interesting as well.
Any tips and suggestions gratefully received. If we get a number ofinteresting direct emails, I'll be happy to summarize the results tothe list.
Best wishes,
Martin


--
Senior Researcher in Corpus Linguistics
Faculty of Linguistics, Philology and Phonetics, University of Oxford
National Co-ordinator, CLARIN-UK
[email protected]
https://orcid.org/0000-0002-4155-0530

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Re: Corpora of English novels

Reply via email to