(Apologies for cross-postings) �
*** The GUM Corpus - Release 10.0.0 *** *** Georgetown University Multilayer corpus *** � Corpling@GU <https://gucorpling.org/corpling/> is happy to announce the first release of series 10 of the Georgetown University Multilayer corpus (GUM V10.0.0): � https://gucorpling.org/gum/ � New in this version: � - 4 new genres with 22 new documents: (total tokens: 228,399) - Courtroom transcripts - Essays - Letters (on paper, not e-mails) - Podcasts - Expansions to the discourse annotation layer - Enhanced RST parses with additional, non-projective tree-breaking relations (multiple relations per node) - Complete signaling annotation including discourse markers and other discourse signals following the Signaling Corpus - PDTB-style connective annotation and DISRPT style relation classification data - Morphological segmentation following UniMorph - Annotation of select constructions based on Construction Grammar (e.g. resultatives, NPN, causal-excess) - Many corrections to all annotation layers � GUM is an open source corpus of richly annotated English texts from 16 genres: academic, bio, courtroom, conversation, essay, fiction, interview, letters, news, podcasts, speeches, textbooks, travel, vlogs, how-to and Reddit forum discussions. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses. � This is the first version of GUM series 10, containing roughly 228K tokens annotated for: � - Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features - Manually corrected lemmatization and morphological segmentation - Sentence segmentation and rough speech act (manual) - Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual) - Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels and enhanced dependencies) - Information status (given-active/inactive, accessible-inferable/common ground/aggregate, and new) - Entity type, salience and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging), as well as Centering Theory annotations - Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions - Discourse parses in enhanced Rhetorical Structure Theory (eRST) and discourse dependencies - Discourse signal annotations classified into 9 major and 45 minor types indicating how the presence of a relation is marked (based on the Signaling Corpus scheme) - Abstractive summaries for each document (two summaries per document in the test set) � Note on Reddit data: token text is not contained in the release but can be downloaded with an included script. � For more information and to search or download the corpus online, see the corpus website <https://gucorpling.org/gum/> . � Best wishes, The GUM team � PS – if you like GUM, check out our ‘extreme genre test set’ GENTLE <https://github.com/gucorpling/gentle/> , and the larger, automatically annotated AMALGUM <https://github.com/gucorpling/amalgum/> corpus! �
_______________________________________________ Corpora mailing list -- [email protected] https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to [email protected]
