Dear Luca,
we have one project that deals with norm data that we try to
standardize. Not really what you ask for here, but you might want to
check anyway:
https://norare.clld.org
Essentially, we assemble and try to standardize norm data across many
languages and across various concepts, concepts are defined according to
another project, the Concepticon (https://concepticon.clld.org) which
provides a reference catalogue for elicitation glosses in comparative
linguistics.
Best,
Mattis
On 03.04.24 17:04, Luca Onnis via Corpora wrote:
I plan to construct a comprehensive word frequency list in many
languages, with a column for each word containing the orthographic word,
another column with the phonemic transcription of the word in IPA, a
third column with the word frequency, and a fourth column with the
language.
The words do not have to be translation equivalents, but could be the
top N thousand words for a given language.
The purpose is to then carry out cross-linguistic analyses of
distributional properties of phonemes and phonotactic sequences. Another
use could be to train grapheme-to-phoneme models for missing words in
the list for each language. Ideally, the final resource would be free to
use for research purposes.
Technically, the word list is easy to compile, but one has to rely on
limited open-source data. While frequency info can be easily obtained
for many languages from open sources, phonemic transcriptions are
typically available in proprietary dictionaries. Wiktionary is a good
starting point, but its coverage might be limited for some languages,
and thus it can provide valuable data for a handful of commonly spoken
languages. I would like to obtain a wider and more representative sample
of world languages.
I am thus inviting anyone with tips/or resources available to contribute
to this compiling effort. Contributors would receive the due
acknowledgments, and could specify under which conditions they want
their data to be used. You can reply here or write to [email protected]
<mailto:[email protected]> .
Best,
Luca Onnis, PhD
Professor, University of Oslo, Norway
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]