(Apologies for cross-postings)

  

*** The GUM Corpus - Release 9.0.0 ***

*** Georgetown University Multilayer corpus ***

  

Corpling@GU <https://gucorpling.org/corpling/>  is happy to announce the first 
release of series 9 of the Georgetown University Multilayer corpus (GUM V9.0.0):

  

https://gucorpling.org/gum/

  

New in this version: 

  

- 20 new documents added including more conversational data (total tokens: 
203,879)

- Abstractive summaries for each document

- Annotations for salient/non-salient entities in each document

- Foreign language tags to identify individual source languages where relevant

- New easier process for reconstructing Reddit text data

- Many corrections to all annotation layers

  

GUM is an open source corpus of richly annotated English texts from multiple 
genres: academic, bio, conversation, fiction, interview, news, speeches, 
textbooks, travel, vlogs, how-to and Reddit forum discussions. The corpus is 
created by students as part of the Computational Linguistics curriculum at 
Georgetown University and is available under Creative Commons licenses.

  

This is the first version of GUM series 9, containing roughly 200K tokens 
annotated for:

  

- Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and 
UPOS) and UD morphological features

- Manually corrected lemmatization

- Sentence segmentation and rough speech act (manual)

- Document structure using TEI tags (paragraphs, headings, figures, captions 
etc., all manual)

- Constituent and dependency syntax (manually corrected Universal Dependencies, 
and PTB parses from gold tags with function labels)

- Information status (given-active/inactive, accessible-inferable/common 
ground/aggregate, and new)

- Entity type, salience and coreference annotation (including non-named 
entities, singletons, appositions, cataphora and several types of bridging)

- Entity linking (Wikification) of all named entities with Wikipedia articles, 
including their non-named and pronominal mentions

- Discourse parses in Rhetorical Structure Theory and discourse dependencies

- Abstractive summaries

  

Note on Reddit data: token text is not contained in the release but can be 
downloaded with an included script.

  

For more information and to search or download the corpus online, see the 
corpus website <https://gucorpling.org/gum/> .

  

Best wishes,

The GUM team

  

  

  

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to