In this newsletter:
New publications:
BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese 
Audio<https://catalog.ldc.upenn.edu/LDC2025S04>
BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Transcripts and 
Translations<https://catalog.ldc.upenn.edu/LDC2025T05>

________________________________

New publications:
BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese 
Audio<https://catalog.ldc.upenn.edu/LDC2025S04> was developed by LDC and 
consists of 93 hours of speech from 236 unscripted telephone conversations 
between native speakers of the Mandarin Chinese dialect spoken in mainland 
China. The calls were collected by LDC in the CALLFRIEND and CALLHOME series 
where participants called family members or close friends and spoke on topics 
of their choice. Around 60% of the recordings (141 calls) are publicly released 
for the first time. The remaining 95 recordings were previously published by 
LDC in various CALLFRIEND, CALLHOME, and HUB5 Mandarin datasets. The data is 
divided into training, development, and evaluation partitions.

The DARPA BOLT (Broad Operational Language Translation) program developed 
machine translation and information retrieval for less formal genres, focusing 
particularly on user-generated content. LDC supported the BOLT program by 
collecting informal data sources -- discussion forums, conversational telephone 
speech, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. 
The material in this release represents the unannotated Chinese source 
conversational telephone speech. The telephone data was transcribed, 
translated, and annotated for various tasks in the BOLT program including word 
alignment, treebanking, and co-reference.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and 
Translations<https://catalog.ldc.upenn.edu/LDC2025T05> contains transcripts and 
corresponding English translations for the conversational telephone speech in 
BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese 
Audio<https://catalog.ldc.upenn.edu/LDC2025S04> and was developed by LDC to 
support the DARPA BOLT program.

Transcribers were required to produce a verbatim transcript of all speech 
within a file using simplified Chinese orthography and to add minimal markup to 
capture salient features of the speech. Some transcripts include redactions for 
potential personally identifying information. All speech data was transcribed 
and is divided into training, development, and evaluation partitions.

The goal of the BOLT translation task was to translate the Chinese transcripts 
into fluent English while preserving the meaning present in the original 
Chinese text. Transcripts in the development and evaluation partitions received 
first pass and gold standard translations. 89% of the transcripts were 
translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104





_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to