Hello!

I wrote a module solving the following statistic problem:
Given are many object with a feature that can be divided in several groups,
e.g. looking to the salary of men/women, regarding to the income,
let's say $0-$500, $500-$1000, $1000-$2000, $2000-$5000, $5000-$20_000, >$20_000.
We have a second feature, what we are supposing to be highly correlated with the first 
one.
Let's say in the example, the rent of their apartment.
We suppose that objects of one group are quite similar,
or statistically speaken,
the variance of the first feature is low in every group.

Choosing the first feature as one that is simple to get,
we suppose it's expensive to get the second.
Let's think to a questionnaire,
where we can only pay to ask 10'000 people of all them
for their rent.

Then our aim is to reduce the variance of the correlated second feature.
There's a simple mathematic answer how to achieve it:
Calculate the variance of the first feature for each group.
Now take $x[$i] objects of the $i-th group,
where $x[$i] is proportianel to the variance and the size of the group.
The main problem in implementing is to take care
for the limitedness corrections, as the proportional returns float numbers,
while we need integers and exactly sum @x == the wanted number (e.g. 10_000).

In german it's called "Optimale Schichtung bei geschichteteten Zufallsproben".
As I'm not native english speaker,
I don't know a good translation.
A module name could be
Statistics::PerfectDistribution
or
Statistics::GroupedSamples

but I don't feel very lucky with it.
I hope anyone has a better idea.


Thanks in advance,
Janek Schleicher

Reply via email to