On 11 March 2010 14:21, 'Dragon' Dave McKee <[email protected]> wrote:
Hmmm, I'm tired after a training this morning, so I haven't had time
to really think about this, but I'm not sure that your approach works.
> So there's N people with a given full name ('Stefan Magdalinski', for
> example).
... something we don't actually know - unless someone has the relevant
data - but maybe we can guess it.
> There's L registered lobbyists, and V whitehouse visitors.
Ah, but the key thing here is you don't know what V is. You have a
list of names, but you don't know which of them are distinct visitors
- that's part of what we want to be able to estimate.
What you actually want to know is the probability distribution of
visits by a particular lobbyist. Eg suppose you know I am a registered
lobbyist and there are 11 "Francis Davey" log entries. Call the number
of times I visited A. You know that:
P(A<0) = P(A>11) = 0
what you want to work out is what the distribution of A is *given* the
data you have. How many of those visit are me?
> (the population of america is P, so there's a L/P chance of being a lobbyist,
Agreed - take a random person in the US, their chance of being a
lobbyist is close to L/P.
> and a V/P chance of being a visitor, unless there's a way of reducing this?)
Sadly *this* we can't say, since we don't know how many distinct
people visited (and anyway there are problems with assuming that a
person who isn't a lobbyist has the same chance of being a visitor as
someone who is - this will certainly be false).
.... and I'm afraid it all falls down from there.
I can't see the list of lobbyists names though.
--
Francis Davey
_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public