So there's N people with a given full name ('Stefan Magdalinski', for example).
There's L registered lobbyists, and V whitehouse visitors.
(the population of america is P, so there's a L/P chance of being a lobbyist,
and a V/P chance of being a visitor, unless there's a way of reducing this?)

Assuming that lobbyists and visitors are independant (i.e. there is no
true correlation)
then the probability that, for a given name, both a lobbyist and
visitor exist is given by:
p(lobbyist in N) * p(visitor in N)

p(visitor in N) = 1- p(NOT visitor in N) = 1- (1-(V/P))^N

However, we actually *know* the probability p(lobbyist in N), so we
can ignore this and assign it as 0 or 1. (1 whenever we care - i.e.
have a lobbyist). Or should we do this for the visitor? Largest
number? Someone who knows more stats than me should probably do this.

So for John Adams, N=8568
http://futureboy.homeip.net/fsp/namefreq.fsp?firstName=john&lastName=adams&pop=300+million
P=300e6, V is guessed at 1000.

1-(1-(1000/300e6))^8568 = 0.03 - i.e. there is a 3% chance that this
match would occur by chance.
For V=10000 this goes up to 24%, for V=100, this goes down to 0.3%
For N=80000 this goes up to 23%, for N=800, this goes down to 0.3%

Someone who actually remembers more stats than me, please check I'm
not barking up the wrong tree.

Dave, who doesn't trust the source he used for the name: being called
John and Adams are STRONGLY CORRELATED.

_______________________________________________
Mailing list [email protected]
Archive, settings, or unsubscribe:
https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public

Reply via email to