[Corpora-List] GUM Corpus V10 - new genres and annotations

Amir Zeldes via Corpora Mon, 19 Feb 2024 08:07:21 -0800

(Apologies for cross-postings)

 �


*** The GUM Corpus - Release 10.0.0 ***

*** Georgetown University Multilayer corpus ***

 �

Corpling@GU <https://gucorpling.org/corpling/>  is happy to announce the first 
release of series 10 of the Georgetown University Multilayer corpus (GUM 
V10.0.0):

 �

https://gucorpling.org/gum/

 �

New in this version: 

 �

- 4 new genres with 22 new documents: (total tokens: 228,399)

  - Courtroom transcripts

  - Essays

  - Letters (on paper, not e-mails)

  - Podcasts

- Expansions to the discourse annotation layer

  - Enhanced RST parses with additional, non-projective tree-breaking relations 
(multiple relations per node)

  - Complete signaling annotation including discourse markers and other 
discourse signals following the Signaling Corpus

  - PDTB-style connective annotation and DISRPT style relation classification 
data

- Morphological segmentation following UniMorph

- Annotation of select constructions based on Construction Grammar (e.g. 
resultatives, NPN, causal-excess)

- Many corrections to all annotation layers

 �

GUM is an open source corpus of richly annotated English texts from 16 genres: 
academic, bio, courtroom, conversation, essay, fiction, interview, letters, 
news, podcasts, speeches, textbooks, travel, vlogs, how-to and Reddit forum 
discussions. The corpus is created by students as part of the Computational 
Linguistics curriculum at Georgetown University and is available under Creative 
Commons licenses.

 �

This is the first version of GUM series 10, containing roughly 228K tokens 
annotated for:

 �

- Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and 
UPOS) and UD morphological features

- Manually corrected lemmatization and morphological segmentation

- Sentence segmentation and rough speech act (manual)

- Document structure using TEI tags (paragraphs, headings, figures, captions 
etc., all manual)

- Constituent and dependency syntax (manually corrected Universal Dependencies, 
and PTB parses from gold tags with function labels and enhanced dependencies)

- Information status (given-active/inactive, accessible-inferable/common 
ground/aggregate, and new)

- Entity type, salience and coreference annotation (including non-named 
entities, singletons, appositions, cataphora and several types of bridging), as 
well as Centering Theory annotations

- Entity linking (Wikification) of all named entities with Wikipedia articles, 
including their non-named and pronominal mentions

- Discourse parses in enhanced Rhetorical Structure Theory (eRST) and discourse 
dependencies

- Discourse signal annotations classified into 9 major and 45 minor types 
indicating how the presence of a relation is marked (based on the Signaling 
Corpus scheme)

- Abstractive summaries for each document (two summaries per document in the 
test set)

 �

Note on Reddit data: token text is not contained in the release but can be 
downloaded with an included script.

 �

For more information and to search or download the corpus online, see the 
corpus website <https://gucorpling.org/gum/> .

 �

Best wishes,

The GUM team

 �

PS – if you like GUM, check out our ‘extreme genre test set’ GENTLE 
<https://github.com/gucorpling/gentle/> , and the larger, automatically 
annotated AMALGUM <https://github.com/gucorpling/amalgum/>  corpus!

 �

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] GUM Corpus V10 - new genres and annotations

Reply via email to