>> I am looking for freely available English corpora that include lemmas of the
>> words. Corpora would be used as a gold standard, so lemmas should be
>> hand-annotated or at least human verified.
>> So far I had only found British National Corpus: http://www.natcorp.ox.ac.uk/
All of the BYU corpora (http://corpus.byu.edu) are based directly on the
original PoS tagging and lemmatization in the BNC, including the 520 million
word COCA corpus, the 1.9 billion word GloWbE Corpus, and new NOW corpus (3.3
billion words, and growing by 4-5 million words a day).
In addition to the free web-based interface, the corpus data is also available
in downloadable full-text format: http://corpus.byu.edu/full-text/, including
free samples (~2 million words each from COCA, COHA, and GloWbE).
The lemmatization was subsequently corrected for the word frequency data that
is based on these corpora (http://www.wordfrequency.info/), which includes the
top 60,000 lemmas in COCA, and the top 100,000 word forms (+PoS and lemmas) in
COCA, COHA, BNC, and SOAP. In both cases, the word frequency / lemma lists were
Professor of Linguistics / Brigham Young University
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
From: corpora-boun...@uib.no <corpora-boun...@uib.no> on behalf of matej
Sent: Monday, September 19, 2016 3:32 AM
Subject: [Corpora-List] English corpora with lemmas
I am looking for freely available English corpora that include lemmas of the
words. Corpora would be used as a gold standard, so lemmas should be
hand-annotated or at least human verified.
So far I had only found British National Corpus: http://www.natcorp.ox.ac.uk/
Any suggestion about any other available corpora would be helpful. Thanks!
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list