Re: Question

Federico Miyara Mon, 19 May 2025 23:57:15 -0700


Thank you, Reginald and Michael, for paying attention to my problem.

First of all, and since I'm replying to the list after some privateconversation with Reginald (very interesting, by the way), I'll tryfirst to clarify the motivations, and then remove the anecdotalinformation leaving the bare mathematical problem so that it is easierto cast it into an adequate conceptual framework. If too boring, go toparagraph 6 or 7.

My problem is to get lists of phonetically balanced words for use inintelligibility tests (either in an audiological context, anarchitectural acoustics one or a communication receiver one).Phonetically balanced means that the phonemes appear in the list ofwords with approximately the same probability as they appear in generallanguage usage. This way, the test exposes the subject or patient to asituation similar to natural language speech with far less utterances.

There is a number of observations. First, the probabilities of thephonemes are drawn from statistics over some corpus. Second, thedictionary from which the words of my target lists are taken may or maynot be the same set of words used in the corpus. For instance, may be Iwant to include all the words of a general dictionary as potentialmembers of the lists or that for different reasons I wish to limit it toa subset, for instance, the 2000 words more frequently used, the wordswith two syllables, or the words used in a given context (for instance alocal community, a profesional specialty).

The process that converts words into groups of phonemes, calledphonemic transcription, is not straightforward but for my purposes wecan assume it has been performed previously. The process of getting thestatistics of appearance in the corpus is really straightforward and wecan also assume it has been already performed. Notice that theprobabilities might also be imposed arbitrarily (for instance, for anexperiment one might want to exaggerate the probability of certainparticular phonemes).

Let's call the phonemes "symbols"; the set of all phonemes, "alphabet";any (typically short) sequence of symbols, "word"; the set from whichthe words that form the target list are taken, "dictionary". Let p =[p(1), ..., P(n) ] be the vector of probabilities corresponding to thevector of symbols [S(1), ..., S(n)]


Then the problem can be stated as follows:

Given an alphabet of n symbols S(1), ..., S(n) and a dictionary Dcontaining N words of variable length, generate a list L of M words suchthat the probability of finding symbol S(k) in the list matches asbetter as possible some given probabilities p(k) or, symbolically,

SUM ( |P(s = S(k) / s belongs to L) - p(k)| ) = min { SUM ( |P(s =S(k) / s belongs to Li) - p(k)| ) }

where P is the probability and the minimum is taken over all possibleM-word lists Li that can be taken from D. The SUM operator is over k =1, ..., n, and could be replaced by the sum of squares or any othersuitable metrics. Note that "s belongs to L" is abuse of language, shortfor "s belongs to a word belonging to L".

Typically the words are restricted, for example, to disyllables. This istransparent to the problem, since the dictionary can be cropped toreflect such restriction. Of course, the symbols may and will repeat.

Note also that the words in the list shouldn't be repeated, the M wordsshould be different.


Hope this clarifies the problem.

Best regards,

Federico Miyara



On 19/5/2025 23:59, Reginald Beardsley wrote:

FWIW Below is my attempt at disambiguating the problem. I have *some*confidence it is an accurate description, but assert no more. I've nothad a reply from Federico yet, but expect one soon.
Translating a word problem into a mathematical problem is fraught withperil. Much of my career was spent helping people convert their wordproblem into a mathematical problem. Most of the time they immediatelyknew how to solve the problem after I asked some questions and gavethem the proper mathematical formulation. My English lit BA degree hadbroad application to PhD level mathematics. Go figure. I'd neverexpected that.
From a series of emails with Federico, I *think* this is an accuratedescription:
Given phoneme frequencies for all of a language, e.g. Spanish, howdoes one select the best combination of repetitions of a subset of theentire lexicon to best match the phoneme distribution of the entirelanguage? The goal being evaluating verbal intelligibility in speechcommunication. It looks to me to be a straight forward linear errorminimization problem.
This seems to me a classic sparse L1 program as described in a paperby Emanuel Candes which he called it "The Dantzig Selector" in theearly 2000s as an application of sparse L1 pursuits. I am acutelyinterested in whether that is correct. Not merely in the solution ofFederico's problem, but my own understanding of sparse L1 pursuits asI learned from Focuart and Rauhut's "A Mathematical Introduction toCompressive Sensing." Having worked for 3 years without almost no onewith whom to converse, I'm less than confident I have all the nuancescorrect. If I am wrong, I should very much appreciate an explanation.I no longer have the pleasure of doing this as an occupation, butmoney was never what motivated me.
My perception is that the solution is straight forward, but tedious toimplement because of size. The obvious solution to me is, write aprogram to generate a GMPL file. I *think* it is a mixed integerproblem, but not yet convinced that's the best formulation.
I am very grateful for the assistance provided by this list 8-10 yearsago and should very much enjoy contributing something useful. GLPK isa true "tour de force". Many thanks to Andrew et al.
Have Fun!
Reg
----- Forwarded Message -----
*From:* Reginald Beardsley <[email protected]>
*To:* Federico Miyara <[email protected]>
*Sent:* Monday, May 19, 2025 at 12:10:37 PM CDT
*Subject:* Re: Question

Federico,

Is this a correct statement of your problem?
You have a dictionary of N words composed of various elements from aset of M symbols each of which has a certain number of occurrences inthe N words. The number of symbols from M which form an word in theset N varies but is small.
You wish to determine the number of recurrences of a smaller subset ofP words which have the same proportion of the M symbols as the entireset of N words, but with P << N. Further, you wish to be able toselect different subsets of P words that all have the best matchedfrequency of occurrence of the M symbols in P as in N with Qselections from P. The particular sets of P words in each case beingchosen independently based on other criteria unrelated to thefrequency of occurrence of the M symbols in N.
The desired output is a list of the number of occurrences of each wordin the set P which best approximates the number of occurrences of theM symbols in the set N for Q selections from the list M for the case Q>= P.
The fundamental problem is then constructing Ax=y where y is thevector of probabilities of each element in M in N. The vector x is ainteger valued number of repetitions of words in P. The hard part iscreating the correct A matrix.
Have Fun!
Reg
On Friday, May 16, 2025 at 04:55:34 PM CDT, Federico Miyara<[email protected]> wrote:
I need to solve the following problem:

I have an alphabet of n symbols and a dictionary with N words of m
symbols (n in the order of tens, N in the order of tens of thousands, m
= 4, say)

Assuming each symbol has a definite probability, I need to generate a
list of M words (M in the order of 100) taken from the dictionary in
which the proportion of each symbol matches as best as possible the
required probability.

Is this a problem that can be solved using GLPK?

Thanks.

Bes regards,

Federico Miyara

--
Este correo electrónico ha sido analizado en busca de virus por elsoftware antivirus de Avast.
www.avast.com



--
Este correo electrónico ha sido analizado en busca de virus por el software 
antivirus de Avast.
www.avast.com

Re: Question

Reply via email to