[Corpora-List] 3rd CfP and deadline extension: CMLC-12: 12th Workshop on the Challenges in the Management of Large Corpora

Marc Kupietz via Corpora Fri, 13 Feb 2026 01:03:21 -0800

12th Workshop on the Challenges in the Management of Large Corpora The next 
meeting of CMLC (see also  http://corpora.ids-mannheim.de/cmlc.html ) will be 
held as part of the LREC-2026 conference [3] in Palma, Mallorca. 
  3rd Call for Papers and deadline extensionImportant dates  * 
   Deadline for paper submission: the 16th
   25th of February 2026 (Monday, 23:59 UTC)
 * Notification of acceptance: the 12th of March 2026 (Thursday)
 * Deadline for the submission of camera-ready papers: the 30th of March 2026 
(Monday)
 * Meeting: the 11th of May, morning slot
 Paper submission  * We invite anonymised extended abstracts for oral 
presentations on the topics
   listed above, as PDF created according to LREC-2026 templates [1].
   Length and content: 4 to 8 pages in length, excluding acknowledgements, 
references,
   potential Ethics Statements and discussion on Limitations. Appendices or
   supplementary material are not permitted during the initial submission
   phase, as papers should be self-contained and reviewable on their own.
   However, appendices and supplementary material will be allowed in the
   final, camera-ready version of the paper.
 * CMLC has always reserved a track for national corpus project reports, and to
   this end, we invite poster proposals of 500-750 words. National project
   reports need not be anonymised.
 * Submissions are accepted solely through the LREC START system [2].
 * A volume of proceedings will be published online by ELRA. Oral and poster
   contributions will have equal status.
 Workshop description As in the previous CMLC meetings, we wish to explore 
common areas of interest across a range of issues in language resource 
management, corpus linguistics, natural language processing, natural
language generation, and data science.
 Large textual datasets require careful design, collection, cleaning, encoding, 
annotation, storage, retrieval, and curation to be of use for a wide range of 
research questions and to users across a
number of disciplines. A growing number of national and other very large 
corpora are being made available, many historical archives are being digitised, 
numerous publishing houses are opening their
textual assets for text mining, and many billions of words can be quickly 
sourced from the web and online social media.
 A mixed blessing of the times is that much of those texts, in mono- and 
multi-lingual arrangements can now be created automatically by exploiting Large 
Language Models at various scales. That, on the
one hand, makes it possible to inflate the amounts of data where normally data 
would be scarce: in under-resourced languages or language varieties, in 
specific genres or for intricate and rarely
attested constructions. On the other hand, such procedures immediately raise 
concerns regarding the authenticity and quality of such data, casting doubt on 
the possibility of adequately (truthfully,
verifiably, reproducibly) addressing the kind of research questions that 
provoked the rapid but tainted increase of the available data volumes in the 
first place. Similar doubts may be directed at
mass creation of secondary and tertiary data ordinarily crucial for linguistic 
research: apart from potential legal constraints on the use of the initial 
amounts of human-created data, new questions
arise as to the legal status of the derived data, the ways to create e.g. 
provenance metadata of the derived resources, and the level of trust regarding 
mass-produced grammatical (and other)
annotation layers.
 These new as well as more traditional questions lie at the base of the list of 
topics that management of large corpora (for any currently suitable definition 
of “large”) invokes or at least strongly
brushes against.
 Topics of interest This year's event adds new items to the standard range of 
CMLC themes and addresses some of LREC-2026 focus topics:
  * 
   Interoperability and accessibility
   How to make corpora as accessible as possibleInteroperable APIs for query 
and analysis softwareProvision of multiple levels of access for different tasks
 * Machine/Deep Learning
   Data preparation for machine learning inputCreation, curation, maintenance 
and 
   dissemination of language models based on machine learning (e.g. word 
   embeddings and entire deep learning networks)Legal issues concerning 
language model distribution
 * Linguistic content challenges
   Dealing with the variety of language: 
   multilinguality, minority and/or underrepresented languages, historical 
   texts, noisy OCR texts, user-generated content, etc.Diversity and inclusion 
in language resourcesIntegration of human computation (crowdsourcing) and 
automatic annotationQuality management of
   annotationsEnsuring linguistic integrity of data 
   through deduplication, correction of typos and errors, removal of 
   incomplete or malformed sentences, and filtering harmful, offensive and 
   toxic content, etc.Integrating different linguistic data types (text, audio, 
video, facsimiles, experimental data, neuroimaging data, …)
 * Technical challenges
   Storage and retrieval solutions for large 
   text corpora: primary data (potentially including facsimiles, etc.), 
   metadata, and annotation dataCorpus versioning and release 
managementScalable and efficient NLP tooling for 
   annotating and analysing large datasets: distributed and GPGPU 
   computing; using big data analysis frameworks for language processingDealing 
with streaming data (e.g. Social Media) and rapidly changing 
corporaEnvironmental impact of big language data
   computingEngineering and management of research software
 * Exploitation challenges
   Legal and privacy issuesQuery languages, data models, and 
standardisationLicensing models of open and closed data, coping with 
intellectual property restrictionsInnovative approaches for
   aggregation and visualisation of text analyticsRepurposing or extending 
application areas of existing corpora and tools
 National corpus initiatives In the tradition of CMLC, we invite reports on 
national corpus initiatives; submitters of these reports should be prepared to 
present a poster. Given that it's been a while since the last round, we
would be happy to have a little "What's the news?" session, and we cordially 
invite both our veteran presenters as well as colleagues who have not yet 
introduced their national corpus projects, 
 Our poster sessions are usually scheduled to overlap with the coffee break, to 
ensure informal atmosphere and to maximally use the time slot available to us. 
A flash presentation section is plan for
just before the poster session: ca. 3 minutes for the highlights.
 LRE 2026 Map and the "Share your LRs!" initiative When submitting a paper from 
the START page, authors will be asked to provide essential information about 
resources (in a broad sense, i.e. also technologies, standards, evaluation 
kits, etc.) that
have been used for the work described in the paper or are a new result of your 
research. Moreover, ELRA encourages all LREC authors to share the described LRs 
(data, tools, services, etc.) to enable
their reuse and replicability of experiments (including evaluation ones).
 Programme Committee  * Laurence Anthony (Waseda University, Japan)
 * Vladimír Benko (Slovak Academy of Sciences)
 * Felix Bildhauer (IDS Mannheim)
 * Mark Davies (English-Corpora.org)
 * Nils Diewald (IDS Mannheim)
 * Kaja Dobrovoljc (University of Ljubljana / Jožef Stefan Institute)
 * Jarle Ebeling (University of Oslo)
 * Tomaž Erjavec (Jožef Stefan Institute, Ljubljana)
 * Andrew Hardie (Lancaster University, UK)
 * Serge Heiden (ENS de Lyon)
 * Ulrich Heid (University of Hildesheim)
 * Nancy Ide (Vassar College / Brandeis University)
 * Olha Kanishcheva (Heidelberg University)
 * Gražina Korvel (Vilnius University)
 * Natalia Kocyba (Samsung Poland)
 * Michal Křen (Charles University, Prague)
 * Anna Latusek (ICS PAS, Warsaw)
 * Paul Rayson (Lancaster University)
 * Laurent Romary (INRIA)
 * Thomas Schmidt (University of Duisburg-Essen)
 * Serge Sharoff (University of Leeds)
 * Maria Shvedova (Kharkiv Polytechnic Institute / University of Jena)
 * Irena Spasić (Cardiff University)
 * Martin Wynne (University of Oxford)
 Organising Committee  * 📩 Piotr Bański (IDS Mannheim)
 * 📩 Dawn Knight (Cardiff University)
 * 📩 Marc Kupietz (IDS Mannheim)
 * 📩 Andreas Witt (IDS Mannheim)
 * 📩 Alina Wróblewska (ICS PAS, Warsaw)


[1] LREC-2026 templates https://lrec2026.info/authors-kit/
[2] LREC START system https://softconf.com/lrec2026/CMLC2026/
[3] LREC-2026 conference https://lrec2026.info/

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] 3rd CfP and deadline extension: CMLC-12: 12th Workshop on the Challenges in the Management of Large Corpora

Reply via email to