Hey all, Just to let you know, that I had lunch with a UX researcher about the proposal. I have some thoughts and notes which I will type up this week.
2 persons should be enough for a pilot to work out kinks, and 30 people would work in an academic setting for the statistics we want. A Mech Turk study could be complementary but it doesn't really replace. The idea for a scientific study would be to vary one variable at a time (the fingerprint mechanism) and a mechanical turk study varies potentially thousands (lighting, time of day, tiredness, screen, instructions, etc). It could be good to do the scientific study to inform a larger mechanical turk study varying more variables. As for so many trials, the "learning effect" is a thing, and you can account for it statistically by varying the order the trials are presented in per subject. C On Thu, May 1, 2014 at 11:00 AM, Ben Laurie <[email protected]> wrote: > On 1 May 2014 05:18, Joseph Bonneau <[email protected]> wrote: >> [Starting a new thread on this] >> >> Sorry for being a week late, but I read over Tom's proposal at: >> https://github.com/tomrittervg/crypto-usability-study >> >> Having spent much of the last two weeks reviewing papers for SOUPS >> (Symposium on Usable Privacy and Security) and discussing design flaws in >> many security usability studies, there are a few points I'm concerned about: >> >> *Assigning study participants to multiple treatments (e.g. having them test >> multiple methods of fingerprint comparison) introduces a number of issues. >> You have data points that are now correlated, so you can just average >> performance for each treatment and compare. The statistics get vastly more >> complicated to do correctly and your statistical power goes down. More >> importantly, there are real external validity questions here. Users will >> learn and be more clever and alert after seeing multiple systems. Most users >> will only ever see one. >> >> *While not really stated, I'm imagining the same participant will be asked >> to perform multiple trials for each treatment. This also introduces some >> complexity when doing the statistical analysis, but is more manageable. >> >> *Also not really stated is the "base rate" of errors. If users are doing the >> experiment multiple times, we want the base rate to be extremely low to have >> reasonable validity. In reality, fewer than 1% of fingerprints you ever >> compare are going to mismatch. People do an approximation of Bayesian >> reasoning, and if their prior probability is 0.99 that the fingerprints >> match, this is much different than in an experiment if they're mismatching >> half the time and participants come to expect it. There are two ways around >> this: a deception study (the "head fake" approach as Tom put it) or having >> users do the task many times and the vast majority of fingerprints match. >> >> *This proposal includes 10 experimental treatments. That's a lot. If we >> don't re-use participants, we already need 10 people just to have one person >> try each experiment, and that's assuming the phone method doesn't take two >> people. If we ask them all to do many dummy trials with matching >> fingerprints to screen for errors, this is an utterly impractical study. >> >> Overall I think it will be nigh-impossible to do this study in-person and >> have a sufficient sample size. I propose doing the study online using Amazon >> Mechanical Turk (mTurk). This is now standard for psychology experiments in >> general and security usability experiments as well. While not perfect >> multiple studies have confirmed this is a much more representative user >> population than in-person studies ever obtain. >> >> I would do the experiment as follows: >> *For the phone comparison method, play an audio recording of somebody >> reading the fingerprint and display it on screen for comparison. This isn't >> perfect, as it's non-interactive, but it's a start. >> *For the business card method, show a JPEG of a business card, and have them >> compare to a version rendered in text. >> *Assign each user to only 1 treatment and have many more total users. >> Because you pay users for time, this is basically cost-neutral. >> *Give each user 50-100 trials, and have perhaps 1-5 of them be incorrect. >> This would best be randomized since occasionally users talk out of band >> about studies. Have a 50-50 mix of random and targeted errors. >> >> Now for 10 treatments of user we, we can aim for 100 users per treatment. We >> may want to adjust this based on some pilot studies. If the experiment takes >> 10 minutes, this is probably about $1 per user. So we're talking $1000. >> That's a lot, but an in-person user study would be far more. > > I like this plan a lot. I am curious whether its a good idea to give > people so many trials - presumably for most people, verification is a > relatively infrequent thing and so allowing them to become familiar > with it seems like it would bias the results. > _______________________________________________ > Messaging mailing list > [email protected] > https://moderncrypto.org/mailman/listinfo/messaging _______________________________________________ Messaging mailing list [email protected] https://moderncrypto.org/mailman/listinfo/messaging
