Hello! I wrote a module solving the following statistic problem: Given are many object with a feature that can be divided in several groups, e.g. looking to the salary of men/women, regarding to the income, let's say $0-$500, $500-$1000, $1000-$2000, $2000-$5000, $5000-$20_000, >$20_000. We have a second feature, what we are supposing to be highly correlated with the first one. Let's say in the example, the rent of their apartment. We suppose that objects of one group are quite similar, or statistically speaken, the variance of the first feature is low in every group.
Choosing the first feature as one that is simple to get, we suppose it's expensive to get the second. Let's think to a questionnaire, where we can only pay to ask 10'000 people of all them for their rent. Then our aim is to reduce the variance of the correlated second feature. There's a simple mathematic answer how to achieve it: Calculate the variance of the first feature for each group. Now take $x[$i] objects of the $i-th group, where $x[$i] is proportianel to the variance and the size of the group. The main problem in implementing is to take care for the limitedness corrections, as the proportional returns float numbers, while we need integers and exactly sum @x == the wanted number (e.g. 10_000). In german it's called "Optimale Schichtung bei geschichteteten Zufallsproben". As I'm not native english speaker, I don't know a good translation. A module name could be Statistics::PerfectDistribution or Statistics::GroupedSamples but I don't feel very lucky with it. I hope anyone has a better idea. Thanks in advance, Janek Schleicher
