Milton has reminded me that I neglected to send this...most people will probably wish to ignore this message. On Sat, Dec 19, 1998 at 01:03:25PM -0500, Milton Mueller wrote: > Kent's posts on this topic are becoming increasingly shrill and > personal. Let see if we can keep this rational and focused. I'm sorry if this comes across as a personal attack. The unavoidable fact is, I really don't think it's a very good paper, and even under the best of circumstances it's likely that someone would take such a criticism personally. Given that in this case the criticism deals with the possibility of a lack of basic competence in an area important to your career, it is completely understandable that you would take it very personally. But the fact remains -- it isn't a very good paper, and, independent of whether clever statistical techniques could have helped, the paper makes unsubstantiatable conclusions, and, more to the point, exhibits no awareness or understanding of its flaws. Your replies on this topic continue to exhibit such a lack of awareness. > Kent wants to argue that the utilization of straightforward > statistical sampling techniques could have provided a study of > domain name trademark conflicts with a rigorously representative > sample of the entire population of all such disputes. Not quite what I was arguing. See below. > He bases > this claim on distant memories of (presumably) old college > statistics courses. He is not a social scientist and has no > claim to expertise in this matter. Interestingly enough, my first degree was in Psychology, from Stanford; and good experimental design and how to get the most information you can from your data (analysis of variance, factor analysis, etc) were required areas of study (at least back then). Later I got degrees in math and computer science, which included further exposure to probability and statistics, from a more mathematically rigorous point of view. It is true, however, that currently I am more likely to be reminded of the subject in the context of discussions of Monte Carlo hydrodynamics simulations than I am in any social science context. > The premise is false. It is neither simple nor straightforward > to get an unbiased sample of all domain name trademark cases on > a global basis. The problem has nothing to do with the > availability or non-availability of mathematical techniques, but > with the availability of data and resources. > > Begin at the beginning. If one wants to do a statistical study, > what population does one sample? There are basically two > choices. > > 1. One can sample the population of *known* trademark domain > name cases, cases about which there are court records and press > accounts. If one takes this tack, one quickly discovers that > there is no need to "sample" these cases. One can discover and > count nearly all of them. And that in fact is what the Syracuse > study did. As of May 1998, we turned up about 135 cases, 121 of > which we had enough information to classify properly. > > With his limited knowledge of statistics, Kent wants to imply Sigh. Bear in mind that there are very few who would be so bold as to claim that their knowledge of *anything* was unlimited. > that we could have used this information and some fancy > calculations to estimate the statistical characteristics of the > *entire population* of domain name trademark cases. But this is > simply confusion on his part. No statistical calculation, no > matter how good, can provide more information or certainty than > a direct count of a population. Indeed, one only needs to employ > sampling techniques when it is not practical to count the entire > population. The confusion here is yours, I'm afraid. The problem is that you did indeed use sampling -- you just used it in a statistically unconscious way: The real population of interest, as you note, is the "entire population of domain name trademark cases" -- that is the population about which we wish to be able to make meaningful statements, and about which we want to be able to draw conclusions. Your set of "known domain name trademark cases" *is a sample* of that "entire population". If you make an inference from your *sample* to the entire population of interest (and you do make such inferences, see below), then you are using a sampling technique, whether you are aware of it or not -- a statistically undisciplined sampling technique, but a sampling technique, nonetheless. As you are well aware (but do not mention in the study), this sample has obvious systematic bias -- for example, it systematically ignores all cases where the parties agree not to talk about the results. It most certainly has other biases as well. > By the same token, a census of all the *known* cases cannot be > mathematically massaged into rigorous knowledge of the *unknown* > cases. This is a rather wild generalization. 1) Statistical inference is rigorous, if one follows the rules -- the conclusions may have annoying addenda like confidence levels etc, but there is no question about the rigor of statistical inference. 2) One most certainly can develop knowledge of unknown cases from knowledge of known cases, both statistically and in general. It is borders on nonsense to say one can't -- the scientific enterprise in its entirety depends fairly heavily on this assumption. In particular, your study would be completely pointless without this assumption. [description of methodological problems deleted] > so one would spend > millions of dollars, and still have a study open to charges of > statistical bias. Implying, of course, that the study is open to such charges... > We think we did it the best way feasible. Count the facts. With > more resources and time, we would be happy to perform more > extensive studies. My sympathies for the methodological problems (essentially unmentioned in the paper, however). The issue of whether more sophisticated techniques could get more information without a substantial increase in cost is open -- just looking at the paper, it is my opinion that probably a more sophisticated, but cost-effective methodology could be found. But fundamentally that is not the issue... > But that should not let the TM lobby off the > hook. The facts that we know--121 cases, a sizable > number--suggest that the domain name trademark interaction is a > lot more favorable to TM holders, and a lot less threatening to > TM interests, that INTA and others would claim. ...the real issue is exemplified in the above paragraph. This makes *inferences* from an obviously biased sample to the total population of interest. This suggests either 1) lack of understanding that making such inferences is not justified, or 2) lack of concern whether they are justified or not. In my opinion, an approach with a higher level of intellectual and academic rigor would be to eschew the obvious advocacy, and perhaps say something like this: "While the data would seem to suggest that the domain name trademark interaction is favorable to TM holders, it remains only a suggestion. For the reasons we described above concerning our methodological problems, no such conclusion can be rigorously substantiated with our data." --- On rereading, the paper is worse than I remembered -- though on my first read I wasn't explicitly looking for statistical errors. Contrary to what you said in previous messages in this thread, there is essentially no mention of the biases inherent in the sample, and there is no discusion of sampling difficulty. We find statements like: Real trademark infringement using domain names is a rare and not very significant problem: 0.0128%, or a total of no more than 257 cases in the generic TLDs administered by NSI. Nearly all infringement activity has been quickly stopped by lawsuits. This extrapolation from the sample to the total population is unjustified. Moreover, it is meaningless, because it uses data from two differently defined samples -- the sample of known domain name disputes developed in the paper, and the reported 3903 disputes that NSI received complaints about. There are obvious techniques by which these samples could be related, but I suspect that the only data really available from NSI was "3903", and thus the samples cannot really be compared. Another terrible lack is some attempt to deal with the unreported cases. Carl Oppedahl has made some wild estimates -- you cite him, but the fact that you don't reference his estimates (which, as I recall, indicate that the number of unreported cases was fairly large, and thus not supporting your conclusions) makes one wonder why... There are other examples, but this is too long already. The overwhelming impression is that statistical rigor was simply not a consideration in the production of this paper. Noting also that it was announced with a "press release" to various mailing lists, and that the last part is a bunch of policy recommendations (ie, advocacy), it is hard to consider it as serious scholarship -- the data and the methodology don't really support the conclusions, it is obviously biased, and it was obviously produced in support of that pre-existing bias. -- Kent Crispin, PAB Chair "Do good, and you'll be [EMAIL PROTECTED] lonesome." -- Mark Twain __________________________________________________ To receive the digest version instead, send a blank email to [EMAIL PROTECTED] To SUBSCRIBE forward this message to: [EMAIL PROTECTED] To UNSUBSCRIBE, forward this message to: [EMAIL PROTECTED] Problems/suggestions regarding this list? Email [EMAIL PROTECTED] ___END____________________________________________
