Dear Martin, It might be useful.
The Sketch Engine database provides access to two corpora relevant to your student's interest. - Gutenberg English Corpus 2020 <https://www.sketchengine.eu/gutenberg-corpora-2020/> - an almost 3-billion-word corpus containing all English books available through the Gutenberg platform as of April 2020. Note that the corpus metadata includes only the author’s birth and death years, not the year of publication. - English Historical Book Collection (EEBO, ECCO, Evans <https://www.sketchengine.eu/historical-collection-eebo-ecco-evans/>) - a collection of over 800 million words from English books published in the UK and US between 1473 and 1820. With best regards, Michal Cukr Sketch Engine team On Tue, Oct 29, 2024 at 3:59 PM Martin Wynne via Corpora < [email protected]> wrote: > Hi Amir, > > Many thanks for getting in touch, and for letting me know about this. I > think the student in this case wants full texts, but this data looks > very interesting, particularly with the rich annotation, so will > definitely be useful for a number of use cases, and I'll add it to my > summary for the list. > > Best wishes, > Martin > > > On 28/10/2024 20:16, [email protected] wrote: > > Hi Martin, > > > > I'm not sure if this will help, but if your student is interested in > doing something with richly annotated Gutenberg data, there is a very > deeply annotated corpus including about 0.5M tokens of samples from Project > Gutenberg novels here, next to data from 7 other genres: > > > > > https://github.com/gucorpling/amalgum/blob/dev/amalgum/fiction/dep/AMALGUM_fiction_adams.conllu > > > > The data is automatically annotated with good quality neural UD parses, > coreference resolution, entity recognition, discourse parses and more, with > excerpts from over 400 novels included. We also have a much smaller but > manually annotated corpus which includes fiction, along with other genres > in our GUM/GENTLE corpora (24 genres total): > > > > https://gucorpling.org/gum/ > > > > Hope these are useful, > > Amir > > ------------ > > Dr. Amir Zeldes > > Assoc. Prof. of Computational Linguistics > > Department of Linguistics > > Georgetown University > > 1437 37th St. NW > > Washington, DC 20057 > > > > https://gucorpling.org/amir > > > > -----Original Message----- > > From: Martin Wynne via Corpora <[email protected]> > > Sent: Sunday, October 27, 2024 8:11 AM > > To: [email protected] > > Subject: [Corpora-List] Corpora of English novels > > > > I have a student who is interested in tracing the development of the > English novel from its origins to the present day (or at least to the start > of the twentieth century), and I'm trying to gather information about > relevant corpora covering this text type and period. > > > > We know about the European Literary Text Collection (ELTeC, > > > https://www.google.com/url?q=https://www.distant-reading.net/eltec/&source=gmail-imap&ust=1730635931000000&usg=AOvVaw2Y1rJdwNxnHfCqswyPsa22) > which will be very useful for the later end of the timescale. We also know > it is possible to assemble a corpus from Project Gutenberg, archive.org, > Oxford Text Archive, etc. > > , but would be interested in re-using any corpora that people might > already have made, which aim to be representative of particular periods > within this genre. > > > > The student has some flexibility with her research question, so while > the original idea of 'English novels' was probably 'novels in English from > Great Britain and Ireland', other related areas such as US novels might be > interesting as well. > > > > Any tips and suggestions gratefully received. If we get a number of > interesting direct emails, I'll be happy to summarize the results to the > list. > > > > Best wishes, > > Martin > > > > -- > > Senior Researcher in Corpus Linguistics > > Faculty of Linguistics, Philology and Phonetics, University of Oxford > National Co-ordinator, CLARIN-UK [email protected] > https://www.google.com/url?q=https://orcid.org/0000-0002-4155-0530&source=gmail-imap&ust=1730635931000000&usg=AOvVaw1i_exZAWOHquyE8Wlol7Le > > > > _______________________________________________ > > Corpora mailing list -- [email protected] > https://www.google.com/url?q=https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/&source=gmail-imap&ust=1730635931000000&usg=AOvVaw3ExL6BwTVsV7vY84JjtMck > > To unsubscribe send an email to [email protected] > > > > -- > Senior Researcher in Corpus Linguistics > Faculty of Linguistics, Philology and Phonetics, University of Oxford > National Co-ordinator, CLARIN-UK > [email protected] > https://orcid.org/0000-0002-4155-0530 > > _______________________________________________ > Corpora mailing list -- [email protected] > https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ > To unsubscribe send an email to [email protected] >
_______________________________________________ Corpora mailing list -- [email protected] https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to [email protected]
