To summarize the replies to my query for corpora of English novels, the
following were suggested (in addition to ELTeC, mentioned in the initial
query - European Literary Text Collection
https://www.distant-reading.net/eltec/). Due to some difficulties in
locating and accessing the corpora listed below under the first two
items, the corpus creators have agreed to deposit them in the Oxford
Text Archive collections - old and new links are given below:
1.
Corpus of Late Modern English Texts (there is a version called CLMET 3.1
as well as the extended version) and the CEN (Corpus of English Novels)
compiled by Hendrik de Smet and others:
CLMET now available here: http://hdl.handle.net/20.500.14106/2574
CEN now available here: http://hdl.handle.net/20.500.14106/2573
2.
Corpus of the Canon of Western Literature
https://www.researchgate.net/publication/321773386_Introducing_the_Corpus_of_the_Canon_of_Western_Literature_A_corpus_for_culturomics_and_stylistics
(downloadable in full from
https://www.researchgate.net/publication/385291433_CCWL_10_Jan_2018rar ).
CCWL now also available here: http://hdl.handle.net/20.500.14106/2575
3.
Amalgum: richly annotated Gutenberg data in the form of a very deeply
annotated corpus including about 0.5M tokens of samples from Project
Gutenberg novels, next to data from 7 other genres:
https://github.com/gucorpling/amalgum/blob/dev/amalgum/fiction/dep/AMALGUM_fiction_adams.conllu
The data is automatically annotated with good quality neural UD parses,
coreference resolution, entity recognition, discourse parses and more,
with excerpts from over 400 novels included. We also have a much smaller
but manually annotated corpus which includes fiction, along with other
genres in our GUM/GENTLE corpora (24 genres total):
https://gucorpling.org/gum/
Many thanks to Bea Alex, Sabine Bartsch, Hendrik de Smet Clarence Green
and Amir Zeldes
I note that this hasn't revealed a great number of available corpora! I
suspect that there a large number of ad hoc datasets that people make
for specific projects, sampling from the large number of text
collections available, but I still don't find many open access reference
corpora for particular genres and time periods.
The CLARIN Resource Family for Literary Corpora
(https://www.clarin.eu/resource-families/literary-corpora) has a number
in other languages, and I'll get this updated with the above corpora and
anything else that I find for English.
Best wishes,
Martin Wynne
On 27/10/2024 12:11, Martin Wynne via Corpora wrote:
I have a student who is interested in tracing the development of the
English novel from its origins to the present day (or at least to the
start of the twentieth century), and I'm trying to gather information
about relevant corpora covering this text type and period.
We know about the European Literary Text Collection (ELTeC,
https://www.distant-reading.net/eltec/) which will be very useful for
the later end of the timescale. We also know it is possible to
assemble a corpus from Project Gutenberg, archive.org, Oxford Text
Archive, etc. , but would be interested in re-using any corpora that
people might already have made, which aim to be representative of
particular periods within this genre.
The student has some flexibility with her research question, so while
the original idea of 'English novels' was probably 'novels in English
from Great Britain and Ireland', other related areas such as US novels
might be interesting as well.
Any tips and suggestions gratefully received. If we get a number of
interesting direct emails, I'll be happy to summarize the results to
the list.
Best wishes,
Martin
--
Senior Researcher in Corpus Linguistics
Faculty of Linguistics, Philology and Phonetics, University of Oxford
National Co-ordinator, CLARIN-UK
[email protected]
https://orcid.org/0000-0002-4155-0530
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]