12^th Workshop on the Challenges in the Management of Large Corpora
2^nd Call for Papers
The next meeting of CMLC will be held as part of theLREC-2026 conference
<https://lrec2026.info/> in Palma, Mallorca.
Workshop description
As in the previous CMLC meetings, we wish to explore common areas of
interest across a range of issues in language resource management,
corpus linguistics, natural language processing, natural language
generation, and data science.
Large textual datasets require careful design, collection, cleaning,
encoding, annotation, storage, retrieval, and curation to be of use for
a wide range of research questions and to users across a number of
disciplines. A growing number of national and other very large corpora
are being made available, many historical archives are being digitised,
numerous publishing houses are opening their textual assets for text
mining, and many billions of words can be quickly sourced from the web
and online social media.
A mixed blessing of the times is that much of those texts, in mono- and
multi-lingual arrangements can now be created automatically by
exploiting Large Language Models at various scales. That, on the one
hand, makes it possible to inflate the amounts of data where normally
data would be scarce: in under-resourced languages or language
varieties, in specific genres or for intricate and rarely attested
constructions. On the other hand, such procedures immediately raise
concerns regarding the authenticity and quality of such data, casting
doubt on the possibility of adequately (truthfully, verifiably,
reproducibly) addressing the kind of research questions that provoked
the rapid but tainted increase of the available data volumes in the
first place. Similar doubts may be directed at mass creation of
secondary and tertiary data ordinarily crucial for linguistic research:
apart from potential legal constraints on the use of the initial amounts
of human-created data, new questions arise as to the legal status of the
derived data, the ways to create e.g. provenance metadata of the derived
resources, and the level of trust regarding mass-produced grammatical
(and other) annotation layers.
These new as well as more traditional questions lie at the base of the
list of topics that management of large corpora (for any currently
suitable definition of “large”) invokes or at least strongly brushes
against.
Topics of interest
This year's event adds new items to the standard range of CMLC themes
and addresses some of LREC-2026 focus topics:
*
Interoperability and accessibility
o How to make corpora as accessible as possible
o Interoperable APIs for query and analysis software
o Provision of multiple levels of access for different tasks
*
Machine/Deep Learning
o Data preparation for machine learning input
o Creation, curation, maintenance and dissemination of language
models based on machine learning (e.g. word embeddings and
entire deep learning networks)
o Legal issues concerning language model distribution
*
Linguistic content challenges
o Dealing with the variety of language: multilinguality, minority
and/or underrepresented languages, historical texts, noisy OCR
texts, user-generated content, etc.
o Diversity and inclusion in language resources
o Integration of human computation (crowdsourcing) and automatic
annotation
o Quality management of annotations
o Ensuring linguistic integrity of data through deduplication,
correction of typos and errors, removal of incomplete or
malformed sentences, and filtering harmful, offensive and toxic
content, etc.
o Integrating different linguistic data types (text, audio, video,
facsimiles, experimental data, neuroimaging data, …)
*
Technical challenges
o Storage and retrieval solutions for large text corpora: primary
data (potentially including facsimiles, etc.), metadata, and
annotation data
o Corpus versioning and release management
o Scalable and efficient NLP tooling for annotating and analysing
large datasets: distributed and GPGPU computing; using big data
analysis frameworks for language processing
o Dealing with streaming data (e.g. Social Media) and rapidly
changing corpora
o Environmental impact of big language data computing
o Engineering and management of research software
*
Exploitation challenges
o Legal and privacy issues
o Query languages, data models, and standardisation
o Licensing models of open and closed data, coping with
intellectual property restrictions
o Innovative approaches for aggregation and visualisation of text
analytics
o Repurposing or extending application areas of existing corpora
and tools
In the tradition of CMLC, we invite reports on national corpus
initiatives; submitters of these reports should be prepared to present a
poster.
Important dates
* Deadline for paper submission: the 16^th of February 2026 (Monday,
23:59 UTC)
* Notification of acceptance: the 12^th of March 2026 (Thursday)
* Deadline for the submission of camera-ready papers: the 30^th of
March 2026 (Monday)
* Meeting: the 11^th of May, morning slot
Paper submission
* We invite anonymised extended abstracts for oral presentations on
the topics listed above, as PDF created according toLREC-2026
templates <https://lrec2026.info/authors-kit/>.
o Length and content: 4 to 8 pages in length, excluding
acknowledgements, references, potential Ethics Statements and
discussion on Limitations. Appendices or supplementary material
are not permitted during the initial submission phase, as papers
should be self-contained and reviewable on their own. However,
appendices and supplementary material will be allowed in the
final, camera-ready version of the paper.
* CMLC has always reserved a track for national corpus project
reports, and to this end, we invite/poster proposals/of 500-750
words. National project reports need not be anonymised.
* Submissions are accepted solely through theLREC START system
<https://softconf.com/lrec2026/CMLC2026/>.
* A volume of proceedings will be published online by ELRA. Oral and
poster contributions will have equal status.
LRE 2026 Map and the "Share your LRs!" initiative
When submitting a paper from the START page, authors will be asked to
provide essential information about resources (in a broad sense, i.e.
also technologies, standards, evaluation kits, etc.) that have been used
for the work described in the paper or are a new result of your
research. Moreover, ELRA encourages all LREC authors to share the
described LRs (data, tools, services, etc.) to enable their reuse and
replicability of experiments (including evaluation ones).
Programme Committee
* Laurence Anthony (Waseda University, Japan)
* Vladimír Benko (Slovak Academy of Sciences)
* Mark Davies (English-Corpora.org)
* Nils Diewald (IDS Mannheim)
* Kaja Dobrovoljc (University of Ljubljana / Jožef Stefan Institute)
* Jarle Ebeling (University of Oslo)
* Tomaž Erjavec (Jožef Stefan Institute, Ljubljana)
* Andrew Hardie (Lancaster University, UK)
* Serge Heiden (ENS de Lyon)
* Ulrich Heid (University of Hildesheim)
* Nancy Ide (Vassar College / Brandeis University)
* Olha Kanishcheva (Heidelberg University)
* Gražina Korvel (Vilnius University)
* Natalia Kocyba (Samsung Poland)
* Michal Křen (Charles University, Prague)
* Anna Latusek (ICS PAS, Warsaw)
* Paul Rayson (Lancaster University)
* Laurent Romary (INRIA)
* Thomas Schmidt (University of Duisburg-Essen)
* Serge Sharoff (University of Leeds)
* Maria Shvedova (Kharkiv Polytechnic Institute / University of Jena)
* Irena Spasić (Cardiff University)
* Martin Wynne (University of Oxford)
Organising Committee
* 📩 Piotr Bański (IDS Mannheim)
* 📩 Dawn Knight (Cardiff University)
* 📩 Marc Kupietz (IDS Mannheim)
* 📩 Andreas Witt (IDS Mannheim)
* 📩 Alina Wróblewska (ICS PAS, Warsaw)
Homepage
CMLC series homepage is located athttp://corpora.ids-mannheim.de/cmlc.html .
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]