Re: random.sample with large weighted sample-sets?

duncan smith Sun, 16 Feb 2014 08:06:54 -0800

On 16/02/14 05:08, Ben Finney wrote:

Tim Chase <python.l...@tim.thechases.com> writes:

I'm not coming up with the right keywords to find what I'm hunting.
I'd like to randomly sample a modestly compact list with weighted
distributions, so I might have

   data = (
     ("apple", 20),
     ("orange", 50),
     ("grape", 30),
     )


That's not a list, it's a tuple. I think you want a list.

When you want a sequence where each position has a semantic meaning, use
a tuple (such as ‘("apple", 20)’). Each item has a meaning *because of*
the position it's in; if the items were in a different order, they'd
mean different things.

When you want a sequence where the positions don't have a special
meaning – each item means exactly the same no matter if you change the
order – that's sometimes called a “homogeneous” sequence, and you want a
list.

So a “record” should be represented as a tuple, and a “table” of records
should be represented as a list of tuples:

     records = [
             ("apple", 20),
             ("orange", 50),
             ("grape", 30),
             ]

and I'd like to random.sample() it as if it was a 100-element list.


[snip]

That's a description of sampling without replacement. The probabilitieschange as items are sampled. e.g. The probability of the first itembeing "apple"is 20/100. But the probability that the second sampled itemis "apple" is either 19/99 or 20/99, depending on the value of the firstsampled item. The following (due to Knuth) will generate indices into anotional list of items.



def indices(n, pop):
    # generates indices into a
    # population list containing
    # items with frequencies in pop
    # [("apple", 10), ("orange", 50), ...]
    N = sum(tup[1] for tup in pop)
    i = m = 0
    while m < n:
        u = random.random()
        if (N-i)*u >= n-m:
            i += 1
        else:
            yield i
            i += 1
            m += 1


>>> list(indices(3, [("apple", 20),("orange", 50),("grape", 30)]))
[8, 27, 78]
>>>

The indices are generated in order, so it could easily be extended togenerate items or item count pairs.

There might be something more efficient based on the hypergeometricdistribution (generate a number of apples, then a number of orangesgiven the number of sampled apples, then a number of grapes given thenumber of sampled apples and oranges, etc.).


Duncan
--
https://mail.python.org/mailman/listinfo/python-list

Re: random.sample with large weighted sample-sets?

Reply via email to