[Corpora-List] Applied Sciences: Special Issue on Tokenization and related phenomena

Yuval Pinter via Corpora Wed, 28 Aug 2024 03:07:46 -0700

Dear Colleagues,

As language models take center stage not only in NLP but in a vast array of
scientific applications, the question of how it is best to map natural
language in textual form into vector space gains more and more interest.
While most popular models still use subword tokens as their atomic units,
“token-free” methods including character-level, byte-level, and encoding of
visual text rendering have been making promising progress. Still,
development and analysis of tokenization and untokenization methods is
advancing at a slower rate than research in model architecture and
optimization technologies, mostly due to the early stage at which
representation is applied, which makes evaluation of new algorithms and
techniques particularly challenging. Fundamental insights into the effect
of representation atomicity on morphological modeling, on multilingual and
crosslingual applications, on computation efficiency, on representations of
groups in society, and on other aspects, are still being gained, making
this research topic ripe for aggregation and integration of findings and
methodologies.


Our special issue, entitled *Atoms of Representation in Natural Language
Processing*, aims to collect such findings and insights, to encourage
diving deep into the relationships between language and computation, and to
foster holistic approaches and collaboration in development and assessment
of different aspects of representation in language models and other NLP
systems and applications.

Suggested themes and article types for submissions include:

   - Novel schemas for subword tokenization and for tokenizer application
   methodologies
   - Benchmarks and analyses of tokenizer effectiveness and quality,
   including crosslingual and multilingual setups, morphological aspects,
   information-theoretic constructions, correlation with quality of learned
   embeddings and downstream model performance, ability to handle linguistic
   phenomena, security implications, societal implications, etc.
   - Development, modification, evaluation, and analysis of token-free
   representation schemata based on textual input
   - Development, modification, evaluation, and analysis of token-free
   representation schemata utilizing multimodal input such as visual, spatial,
   or acoustic signals; combination of different linguistic signals (auditory,
   textual, sign language) into a single input framework
   - Theoretic contributions addressing expressive power or limitations of
   various textual representation methodologies
   - Analysis of the textual modality and its representation on the
   computational level, e.g. of Unicode standards

The full call is available here:
https://www.mdpi.com/journal/applsci/special_issues/2SHP0751R0.
Several waiver discounts are available (contact me personally).

-- 
- Yuval Pinter
Guest Editor
www.yuvalpinter.com

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Applied Sciences: Special Issue on Tokenization and related phenomena

Reply via email to