On 7/13/02 3:33 PM, "Nick Simicich" <[EMAIL PROTECTED]> wrote:
>>> If someone wants to vote manually a couple hundred times I do not care, >> >> But I do. > > Do you think that will skew the numbers that much? Will it? Maybe, maybe not. Can it? Sure. All I need is one script kiddie determined to make "reply-to coercion" win to make the numbers useless. So I need to design to protect myself from that possibility. >>> I would think that if you simply recorded ip >>> addresses (or even an MD5 of each octet) that would settle automated voting >>> down. >> >> And screw over most users from AOL, and any other place that has a >> significant number of addresses sharing a small IP range through timesharing >> or firewalls and proxies. It doesn't work in the general case. > > My point was not to automatically throw those things out. It was to allow > someone who was a third party to judge the reliability of the data, and to > do some selection based on addresses and commonality of addresses while > preserving the actual value of the addresses. The problems here are legion. First, someone makes a subjective decision which votes are correct or not. So you potentially add in all sorts of bias. Scientists who get caught choosing data to fit the graph tend to lose their grants, because that's, well, fraud. And whether or not this third party does that or not, the data will live with a suspicion they might have. Second, looking for, finding and resolving these problems takes time and energy by someone. Right now, that someone is me. If I can design fraud out of the system in the first place, that's a lot more effective use of my time and energy, because you have higher quality and better trusted data. Will I stop 100% of the fraud? Probably not. But if I can remove it below the point of statistical significance, that's good enough, and a lot better than saying "we'll clean it up later". Dirty data is dirty data, and you run the risk of not cleaning it up, just moving the dirt around. If I open the data to users to evaluate, I want them all analyzing from the same set of data. I don't want all 12 to decide which set of data ought to be excluded by their idea of "fair", because you end up with data that will say whatever people want it to say, and different analysis that can't be compared to each other, even though the data source is the same. That makes it basically useless. -- Chuq Von Rospach, Architech [EMAIL PROTECTED] -- http://www.chuqui.com/ Very funny, Scotty. Now beam my clothes down here, will you?
