Tokshop: Tokenization Workshop (ICML 2025) 
 Submission to the Tokenization Workshop begins on April 14, 2025, via 
OpenReview. The deadline for submissions is May 30, 2025, at 11:59pm (anywhere 
on earth). Notifications of acceptance will be sent out on June 9, 2025, and 
camera-ready papers will be due shortly afterward at 11:59pm (anywhere on 
earth). The workshop will take place on July 18, 2025. 
 Workshop Description   The Tokenization Workshop (TokShop) at ICML aims to 
bring together researchers and practitioners from all corners of machine 
learning to explore tokenization in its broadest sense. We will discuss 
innovations, challenges, and future directions for tokenization across diverse 
data types and modalities.  
 Call for Papers 
 Topics of interest include: 
    -  Subword Tokenization in NLP: Analysis of techniques such as BPE, 
WordPiece, and UnigramLM, as well as improvements for efficiency, 
interpretability, and adaptability.    -  Multimodal Tokenization: Tokenization 
strategies for images, audio, video, and other modalities, including methods to 
align representations across different types of data.    -  Multilingual 
Tokenization: Development of tokenizers that work robustly across languages and 
scripts, and investigation into failure modes tied to tokenization.    -  
Tokenizer Modification Post-Training: Methods for updating tokenizers after 
model training to boost performance and/or efficiency without retraining from 
scratch.    -  Alternative Input Representations: Exploration of 
non-traditional tokenization approaches, such as byte-level, pixel-level, or 
patch-based representations.    -  Statistical Perspectives on Tokenization: 
Empirical analysis of token distributions, compression properties, and 
correlations with model behavior.   By broadening the scope of tokenization 
research beyond language, this workshop seeks to foster cross-disciplinary 
dialogue and inspire new advances at the intersection of representation 
learning, data efficiency, and model design. 
 Submission guidelines Our author guidelines follow the ICML requirements 
unless otherwise specified.    -  Paper submission is hosted on OpenReview.    
-  Each submission should contain up to 9 pages, not including references or 
appendix (shorter submissions also welcome).    -  Please use the provided 
LaTeX template (Style Files) for your submission. Please follow the paper 
formatting guidelines general to ICML as specified in the style files. Authors 
may not modify the style files or use templates designed for other conferences. 
   -  The paper should be anonymized and uploaded to OpenReview as a single 
PDF.    -  You may use as many pages of references and appendix as you wish, 
but reviewers are not required to read the appendix.    -  Posting papers on 
preprint servers like ArXiv is permitted.    -  We encourage each submission to 
discuss the limitations as well as ethical and societal implications of their 
work, wherever applicable (but neither are required). These sections do not 
count towards the page limit.      -  This workshop offers both archival and 
non-archival options for submissions. Archival papers will be indexed with 
proceedings, while non-archival submissions will not.    -  The review process 
will be double-blind   
Read more: https://tokenization-workshop.github.io/ 
 
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to