Thank you, Reginald and Michael, for paying attention to my problem.

First of all, and since I'm replying to the list after some private conversation with Reginald (very interesting, by the way), I'll try first to clarify the motivations, and then remove the anecdotal information leaving the bare mathematical problem so that it is easier to cast it into an adequate conceptual framework.   If too boring, go to paragraph 6 or 7.

My problem is to get lists of phonetically balanced words for use in intelligibility tests (either in an audiological context, an architectural acoustics one or a communication receiver one). Phonetically balanced means that the phonemes appear in the list of words with approximately the same probability as they appear in general language usage. This way, the test exposes the subject or patient to a situation similar to natural language speech with far less utterances.

There is a number of observations. First, the probabilities of the phonemes are drawn from statistics over some corpus. Second, the dictionary from which the words of my target lists are taken may or may not be the same set of words used in the corpus. For instance, may be I want to include all the words of a general dictionary as potential members of the lists or that for different reasons I wish to limit it to a subset, for instance, the 2000 words more frequently used, the words with two syllables, or the words used in a given context (for instance a local community, a profesional specialty).

The process that  converts words into groups of phonemes, called phonemic transcription, is not straightforward but for my purposes we can assume it has been performed previously. The process of getting the statistics of appearance  in the corpus is really straightforward and we can also assume it has been  already performed. Notice that the probabilities might also be imposed arbitrarily (for instance, for an experiment one might want to exaggerate the probability of certain particular phonemes).

Let's call the phonemes "symbols"; the set of all phonemes, "alphabet"; any (typically short) sequence of symbols, "word"; the set from which the words that form the target list are taken, "dictionary". Let  p = [p(1), ..., P(n) ] be the vector of probabilities corresponding to the vector of symbols [S(1), ..., S(n)]

Then the problem can be stated as follows:
Given an alphabet of n symbols S(1), ..., S(n) and a dictionary D containing N words of variable length, generate a list L of M words such that the probability of finding symbol S(k) in the list matches as better as possible some given probabilities p(k) or, symbolically,

SUM ( |P(s = S(k) / s belongs to L) - p(k)| )   =   min {  SUM ( |P(s = S(k) / s belongs to Li) - p(k)| ) }

where P is the probability and the minimum is taken over all possible M-word lists Li that can be taken from D. The SUM operator is over k = 1, ..., n, and could be replaced by the sum of squares or any other suitable metrics. Note that "s belongs to L" is abuse of language, short for "s belongs to a word belonging to L".

Typically the words are restricted, for example, to disyllables. This is transparent to the problem, since the dictionary can be cropped to reflect such restriction. Of course, the symbols may and will repeat.

Note also that the words in the list shouldn't be repeated, the M words should be different.

Hope this clarifies the problem.

Best regards,

Federico Miyara



On 19/5/2025 23:59, Reginald Beardsley wrote:

FWIW Below is my attempt at disambiguating the problem. I have *some* confidence it is an accurate description, but assert no more. I've not had a reply from Federico yet, but expect one soon.

Translating a word problem into a mathematical problem is fraught with peril. Much of my career was spent helping people convert their word problem into a mathematical problem. Most of the time they immediately knew how to solve the problem after I asked some questions and gave them the proper mathematical formulation. My English lit BA degree had broad application to PhD level mathematics. Go figure. I'd never expected that.

From a series of emails with Federico, I *think* this is an accurate description:

Given phoneme frequencies for all of a language, e.g. Spanish, how does one select the best combination of repetitions of a subset of the entire lexicon to best match the phoneme distribution of the entire language? The goal being evaluating verbal intelligibility in speech communication. It looks to me to be a straight forward linear error minimization problem.

This seems to me a classic sparse L1 program as described in a paper by Emanuel Candes which he called it "The Dantzig Selector" in the early 2000s as an application of sparse L1 pursuits. I am acutely interested in whether that is correct. Not merely in the solution of Federico's problem, but my own understanding of sparse L1 pursuits as I learned from Focuart and Rauhut's "A Mathematical Introduction to Compressive Sensing." Having worked for 3 years without almost no one with whom to converse, I'm less than confident I have all the nuances correct. If I am wrong, I should very much appreciate an explanation. I no longer have the pleasure of doing this as an occupation, but money was never what motivated me.

My perception is that the solution is straight forward, but tedious to implement because of size. The obvious solution to me is, write a program to generate a GMPL file. I *think* it is a mixed integer problem, but not yet convinced that's the best formulation.

I am very grateful for the assistance provided by this list 8-10 years ago and should very much enjoy contributing something useful. GLPK is a true "tour de force". Many thanks to Andrew et al.

Have Fun!
Reg
----- Forwarded Message -----
*From:* Reginald Beardsley <[email protected]>
*To:* Federico Miyara <[email protected]>
*Sent:* Monday, May 19, 2025 at 12:10:37 PM CDT
*Subject:* Re: Question

Federico,

Is this a correct statement of your problem?

You have a dictionary of N words composed of various elements from a set of M symbols each of which has a certain number of occurrences in the N words. The number of symbols from M which form an word in the set N varies but is small.

You wish to determine the number of recurrences of a smaller subset of P words which have the same proportion of the M symbols as the entire set of N words, but with P << N. Further, you wish to be able to select different subsets of P words that all have the best matched frequency of occurrence of the M symbols in P as in N with Q selections from P. The particular sets of P words in each case being chosen independently based on other criteria unrelated to the frequency of occurrence of the M symbols in N.

The desired output is a list of the number of occurrences of each word in the set P which best approximates the number of occurrences of the M symbols in the set N for Q selections from the list M for the case Q >= P.

The fundamental problem is then constructing Ax=y where y is the vector of probabilities of each element in M in N. The vector x is a integer valued number of repetitions of words in P. The hard part is creating the correct A matrix.

Have Fun!
Reg
On Friday, May 16, 2025 at 04:55:34 PM CDT, Federico Miyara <[email protected]> wrote:



I need to solve the following problem:

I have an alphabet of n symbols and a dictionary with N words of m
symbols (n in the order of tens, N in the order of tens of thousands, m
= 4, say)

Assuming each symbol has a definite probability, I need to generate a
list of M words (M in the order of 100) taken from the dictionary in
which the proportion of each symbol matches as best as possible the
required probability.

Is this a problem that can be solved using GLPK?

Thanks.

Bes regards,

Federico Miyara

--
Este correo electrónico ha sido analizado en busca de virus por el software antivirus de Avast.
www.avast.com



--
Este correo electrónico ha sido analizado en busca de virus por el software 
antivirus de Avast.
www.avast.com

Reply via email to