On 16/02/14 05:08, Ben Finney wrote:
Tim Chase <python.l...@tim.thechases.com> writes:

I'm not coming up with the right keywords to find what I'm hunting.
I'd like to randomly sample a modestly compact list with weighted
distributions, so I might have

   data = (
     ("apple", 20),
     ("orange", 50),
     ("grape", 30),
     )

That's not a list, it's a tuple. I think you want a list.

When you want a sequence where each position has a semantic meaning, use
a tuple (such as ‘("apple", 20)’). Each item has a meaning *because of*
the position it's in; if the items were in a different order, they'd
mean different things.

When you want a sequence where the positions don't have a special
meaning – each item means exactly the same no matter if you change the
order – that's sometimes called a “homogeneous” sequence, and you want a
list.

So a “record” should be represented as a tuple, and a “table” of records
should be represented as a list of tuples:

     records = [
             ("apple", 20),
             ("orange", 50),
             ("grape", 30),
             ]

and I'd like to random.sample() it as if it was a 100-element list.


[snip]

That's a description of sampling without replacement. The probabilities change as items are sampled. e.g. The probability of the first item being "apple"is 20/100. But the probability that the second sampled item is "apple" is either 19/99 or 20/99, depending on the value of the first sampled item. The following (due to Knuth) will generate indices into a notional list of items.


def indices(n, pop):
    # generates indices into a
    # population list containing
    # items with frequencies in pop
    # [("apple", 10), ("orange", 50), ...]
    N = sum(tup[1] for tup in pop)
    i = m = 0
    while m < n:
        u = random.random()
        if (N-i)*u >= n-m:
            i += 1
        else:
            yield i
            i += 1
            m += 1


>>> list(indices(3, [("apple", 20),("orange", 50),("grape", 30)]))
[8, 27, 78]
>>>


The indices are generated in order, so it could easily be extended to generate items or item count pairs.

There might be something more efficient based on the hypergeometric distribution (generate a number of apples, then a number of oranges given the number of sampled apples, then a number of grapes given the number of sampled apples and oranges, etc.).

Duncan
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to