CALL FOR PARTICIPATION AT IberLEF 2025

PastReader 2025
IberLEF Task on Transcription of Historical Content
First edition - Transcribing texts from the past

Shared task website:  https://sites.google.com/view/pastreader2025/home


Held as part of the evaluation forum IberLEF 2025
https://sites.google.com/view/iberlef-2025 in the XLI edition of the
International Conference of the Spanish Society for Natural Language
Processing (SEPLN 2025 https://eventos.ita.es/sepln_2025/inicio/)

September 23, 2025. Zaragoza, Spain


Dear All,

We are pleased to inform you that registration is now open for Task
'PastReader 2025: IberLEF Task on Transcription of Historical Content
(First Edition) – Transcribing Texts from the Past.

The PastReader task was held as part of IberLEF 2025, the shared evaluation
campaign for Natural Language Processing systems in Spanish and other
Iberian languages, collocated with SEPLN 2025 Conference.

This is a novel task focusing on the correction of text extracted from
digitized historical documents. Participants in this task must be able
to generate
clean and corrected versions of texts extracted via OCR from the Spanish
historical press. The corrected text should be faithful to the original,
and take into account common errors derived from the digitization and OCR
process. For this edition, the collection is based on the Hemeroteca
Digital of the National Library of Spain (BNE).

   -

   A dataset of digitized historical press from the BNE will be used.
   -

   The collection contains millions of digitized pages of Spanish
   newspapers and magazines.
   -

   The texts are in PDF format with OCR.
   -

   The corpus includes publications from the 17th to the 20th century.
   -

   The publications cover a wide variety of topics: politics, satire,
   humor, science, religion, illustration, entertainment, sports, art, and
   literature.
   -

   The goal is to advance the automation of the transcription process.


TASK

Two tasks have been created related to the basic workflow in a
transcription process: extraction of text from scanned documents (OCR) and
curation of the extracted text to fix found errors:

   -

   Task 1: Error correction. In this task, participants will be provided
   with the output of an OCR system and will be asked to generate clean and
   corrected versions of the extracted texts.
   -

   Task 2: End-to-end extraction. Due to the advance in multimodal systems,
   this task aims to explore end-to-end approaches, using scanned pages as
   input and expecting to produce curated texts as output.


DATA

For this shared tasks, three subsets of data have been prepared:

   -

   Training set: 8,959 pages (Scanned PDF, OCR output, and corrected
   text).
   -

   Development set: 500 pages (Scanned PDF, OCR output, and corrected text).
   -

   Test set: Subtask 1: 2,736 pages (OCR output only released to
   participants). Subtask 2: 2,736 pages (Scanned PDF only released to
   participants).

The quality of the OCR results varies due to several factors, such as the
date of digitization, available technology, the state of preservation of
the originals, and the complexity of the text structure. Efforts have been
made to improve these texts, including collaborative corrections through
the ComunidadBNE platform. The manually corrected output serves as a
valuable resource for testing and training technology.


Participating in this task is a great opportunity to advance historical
text processing. You'll work with a large dataset from the National Library
of Spain (BNE), improving OCR correction skills and contributing to
research. Your contribution will aid in digitizing historical documents for
future access.

To participate, go to: https://forms.gle/iBwuUzjZdc2JyFDKA


 IMPORTANT DATES


Feb 3rd: Registration open

Mar 17th: Release of training corpora

Mar 31st: Registration closed

Apr 7th: Release of test corpora and start of the evaluation campaign

Apr 14th: End of evaluation campaign (deadline for submission of runs)

Apr 18th: Publication of official results and release of test gold labels

May 12th: Deadline for paper submission

May 30th: Acceptance notification

Jun 16th: Camera-ready submission deadline

July 3rd: Final camera-ready submission deadline (to IberLEF organizers)

Sep, TBD: Publication of proceedings

Sep, TBD: IberLEF Workshop at SEPLN 2025

ORGANIZING COMMITTEE

- Arturo Montejo Ráez (Universidad de Jaén).

- Elena Sánchez Nogales (Biblioteca Nacional de España).

- Gloria Expósito Álvarez (Biblioteca Nacional de España).

- L. Alfonso Ureña López (Universidad de Jaén).

- María Teresa Martín Valdivia (Universidad de Jaén).

- Jaime Collado Montañez (Universidad de Jaén).

- Isabel Cabrera De Castro (Universidad de Jaén).

- María Victoria Cantero Romero (Universidad de Jaén).

- Ana García Serrano (UNED).

- Rocio Ortuño Casanova (UNED).

- Yanco Amor Torterolo Orta (UNED).




Best regards,



The PastReader 2025 organizing committee




[image: Universidad de Jaén] <https://www.ujaen.es/> Arturo Montejo Ráez
Profesor Titular de Universidad | Associated Professor (Tenured)
[email protected]

Universidad de Jaén
Departamento de Informática, A3-114
Las Lagunillas s/n, 23071 - Jaén (Spain)
+34 953 212 882
<https://www.ujaen.es/servicios/sinformatica/sites/servicio_sinformatica/files/piefirmacorreo4/index.html>
ORCID:  http://orcid.org/0000-0002-8643-2714
Researcher ID: D-3387-2009
SINAI Research Group <https://sinai.ujaen.es>

[image: Universidad de Jaén] <https://www.ujaen.es/> *Antes de imprimir
este mensaje, piense si es necesario. Proteger el medio ambiente es cosa de
todos.*
*** CLÁUSULA DE CONFIDENCIALIDAD ***
Este mensaje se dirige exclusivamente a su destinatario y puede contener
información privilegiada o confidencial. Si no es Ud. el destinatario
indicado, queda notificado de que la utilización, divulgación o copia sin
autorización está prohibida en virtud de la legislación vigente. Si ha
recibido este mensaje por error, se ruega lo comunique inmediatamente por
esta misma vía y proceda a su destrucción.

This message is intended exclusively for its recipient and may contain
information that is CONFIDENTIAL. If you are not the intended recipient you
are hereby notified that any dissemination, copy or disclosure of this
communication is strictly prohibited by law. If this message has been
received by mistake, please let us know immediately via e-mail and delete
it.
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to