[Corpora-List] Re: Story markup languages

Amir.Zeldes--- via Corpora Mon, 27 Jan 2025 07:25:15 -0800

Hi Darren,

 �


In the GUM corpus <https://gucorpling.org/gum/> , which includes fiction 
chapters and short stories, we’ve also used who/whom annotations with the TEI 
tag <sp> for speaker, like this:

 �

...

<sp who="#Pag" whom="#Siri"> <s type="decl"> " Oh <hi rend="italic"> man </hi> 
are we in trouble . " </s> </sp> 

</p>

<sp who="#Siri" whom="#Pag">

<p> <s type="decl"> 

" They started it . " 

</s>

...

 �

The data is also available with dependency parses in the conllu format, where 
speaker and addressee comments reflect the same information:

 �

# speaker = Siri

# addressee = Pag

# text = "They started it."

1            "             "             PUNCT  ``            _            3    
        punct    3:punct              
Discourse=evaluation-comment:152->151:0:_|SpaceAfter=No

2            They      they      PRON    PRP              
Case=Nom|Number=Plur|Person=3|PronType=Prs         3              nsubj    
3:nsubj Entity=(35-person-giv:inact-cf1-1-ana)

3            started start      VERB     VBD              
Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin 0              root       
0:root   MSeg=start-ed

4            it            it            PRON    PRP              
Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs              3           
 obj         3:obj     Entity=(34-event-giv:inact-cf2-1-coref)|SpaceAfter=No

5            .             .             PUNCT  .             _            3    
        punct    3:punct              SpaceAfter=No

6            "             "             PUNCT  ''            _            3    
        punct    3:punct _

 �

 �

For speech that can be attributed to a speaker but without actual speech 
happening (e.g. “According to Bob Dylan, behind every beautiful thing, there's 
some kind of pain”), we also have explicit attribution relation annotations in 
the framework of Enhanced Rhetorical Structure Theory (eRST 
<https://gucorpling.org/erst/> ), and there are similar annotations for 
attributions in the framework of the Penn Discourse Treebank as well.

 �

Hope that’s helpful!

Amir

------------

Dr. Amir Zeldes

Assoc. Prof. of Computational Linguistics

Department of Linguistics

Georgetown University

1437 37th St. NW

Washington, DC 20057

 �

https://gucorpling.org/amir 

 �

 �

 �

 �

From: James Tauber via Corpora <[email protected]> 
Sent: Saturday, January 25, 2025 12:29 AM
To: Darren Cook <[email protected]>
Cc: [email protected]
Subject: [Corpora-List] Re: Story markup languages

 �

TEI has a said element type with a who attribute that can be used to encode 
this information.

 �

Alternatively you can use standoff annotation (which is particularly helpful if 
you are doing a lot of other annotations on the same base text).

 �

We've done the latter at the Digital Tolkien Project and I've used it to 
contrast �the style of different characters (as well as change throughout a 
novel).

 �

James

 �

 �

On Fri, Jan 24, 2025 at 7:46 AM Darren Cook via Corpora <[email protected] 
<mailto:[email protected]> > wrote:

Is there any established xml or other markup language for novels and 
short stories?

I'm particularly interested in marking up dialogue with the name of the 
character who is speaking, and then in tools that allow extracting the 
dialogue of each character (e.g. to analyse and contrast the vocabulary 
each uses).

If so, following on from that, are there open-source ML models that try 
to identify the speaker to add this markup, and existing training data?

Thanks,
Darren

_______________________________________________
Corpora mailing list -- [email protected] <mailto:[email protected]> 
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ 
<https://www.google.com/url?q=https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/&source=gmail-imap&ust=1738387855000000&usg=AOvVaw0X3TmzGCUB7hM3Mfz3a2e8>
 
To unsubscribe send an email to [email protected] 
<mailto:[email protected]>

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Re: Story markup languages

Reply via email to