BTW, if you aren't sure whether you have the most current version of rank.plinstalled, you can check this way:
marimba(300): rank.pl --version rank.pl - version 0.03 Copyright (C) 2000-2012, Ted Pedersen & Satanjeev Banerjee & Bridget T McInnes The older version (that gave -1.1000) was version 0.01. On Sat, Feb 9, 2013 at 4:24 PM, Ted Pedersen <tpede...@d.umn.edu> wrote: > ** > > > Hi Karin, > > Thanks again for this report. I believe I know what the problem is - I ran > your data with rank.pl and got a value of -.7379, which seems at least > reasonable. I puzzled over this a bit, but then went back and used an older > version of rank.pl and got the value you reported of -1.100. > > The difference in the older and newer version had to do with how ties are > handling, and I think the new version is more correct. So, I think you > might want to upgrade to version 1.25 of Text::NSP, and that will give you > a version of rank.pl that will hopefully give you more reasonable results > in this case. In general the differences between the old and the new only > appear when there are a significant number of ties, as there was in this > case... > > http://search.cpan.org/~tpederse/Text-NSP/ > > Please let me know if you have any other questions, and thanks again for > your report! > > Good luck, > Ted > > > On Wed, Feb 6, 2013 at 6:15 AM, Ted Pedersen <tpede...@d.umn.edu> wrote: > >> Hi Karin, >> >> This is very interesting, and I will certainly look into this further and >> report back! Thank you for the additional information on this, it does seem >> like an interesting case. >> >> More soon! >> Ted >> >> >> On Wed, Feb 6, 2013 at 2:34 AM, Karin Cavallin <karin.caval...@ling.gu.se >> > wrote: >> >>> Hi Ted >>> >>> since I compare 5 different corpora (size-wise and occurrence-wise), >>> basically all the sets have different number of pairs. I have run >>> rank.pl on on more than 100.000 lexical sets, most of them get no >>> ranking co-efficient since there are no co-occurrences between the sets, >>> some of them do get a co-efficient ranging from -1.0000 to 1.000, as >>> expected. One lexical set gets this -1.1000, the one I sent you. >>> >>> So, I don't think it is due to that the sets are too different, but >>> something that is beyond me. That's why I though it was important to report >>> it to you. >>> >>> /karin >>> >>> Karin Cavallin >>> PhD Student in Computational Linguistics >>> University of Gothenburg, Sweden >>> >>> ------------------------------ >>> *Från:* duluth...@gmail.com [duluth...@gmail.com] för Ted Pedersen [ >>> tpede...@d.umn.edu] >>> *Skickat:* den 6 februari 2013 03:35 >>> *Till:* ngram@yahoogroups.com >>> *Cc:* Karin Cavallin >>> *Ämne:* Re: [ngram] Fwd: -1.1000(sic!) as result from rank.pl [1 >>> Attachment] >>> >>> Hi Karin, >>> >>> I think the problem you are having is due to the fact that you have >>> different number of word pairs in each list, and the fact that most of the >>> word pairs are unique to each list. In general rank.pl expects that the >>> two input files be made up of the same pairs of words (just ranked >>> differently by a different measure of association, for example). When that >>> isn't the case, the program will eliminate any word pairs that aren't in >>> both files and then run. So, I think this combination of issues is causing >>> rank.pl to return this very unexpected value. >>> >>> My guess is that it's the fact that the number of input pairs is >>> different in each file, but I will do a little more checking in the next >>> day or two to really see for sure. Here's a link to the >>> rank.pldocumentation that describes how this particular case is intended to >>> be >>> handled.... >>> >>> >>> http://search.cpan.org/dist/Text-NSP/bin/utils/rank.pl#1.4._Dealing_with_Dissimilar_Lists_of_N-grams >>> : >>> >>> More soon, >>> Ted >>> >>> >>> On Tue, Feb 5, 2013 at 10:06 AM, Ted Pedersen <tpede...@d.umn.edu>wrote: >>> >>>> ** >>>> >>>> >>>> [Attachment(s)<#13cc1126c4259859_13caf6e89cd3f1a3_13caeb04e2f773ea_13cab1d2ea9de4f7_TopText>from >>>> Ted Pedersen included below] >>>> >>>> ---------- Forwarded message ---------- >>>> From: Karin Cavallin karin.caval...@ling.gu.se> >>>> Date: Tue, Feb 5, 2013 at 8:53 AM >>>> Subject: -1.1000(sic!) as result from rank.pl >>>> To: "tpede...@umn.edu" tpede...@umn.edu> >>>> >>>> Dear professor Ted >>>> >>>> I didn't know whom to report this error to, so I hope you can forward >>>> this to the appropriate receiver. >>>> >>>> I have been using the NSP for a while, especially the bigram packages. >>>> I'm working with lexical sets verbal predicate and nominal objects, >>>> and to collocational analysis on them. >>>> I wanted to compare the ranking between sets coming from different >>>> corpora. (I know it is quite uninteresting to do ranking on such >>>> different data, but I am trying different things for my thesis.) >>>> >>>> Today I noticed one lexical set to be -1.1000, this should not be >>>> possible! (I have only noticed this one time) >>>> >>>> karin$ rank.pl 65_anstr.txt 95_anstr.txt >>>> Rank correlation coefficient = -1.1000 >>>> >>>> I attached the files which I get this weird outcome from. >>>> >>>> Best regards >>>> /karin >>>> >>>> Karin Cavallin >>>> PhD Student in Computational Linguistics >>>> University of Gothenburg, Sweden >>>> >>>> sky<>ansträngning<>505 25.1952 2 5 15 >>>> fördubbla<>ansträngning<>1582 10.8890 1 5 15 >>>> koncentrera<>ansträngning<>1912 9.1951 1 11 15 >>>> krävas<>ansträngning<>2172 8.2948 1 17 15 >>>> underlätta<>ansträngning<>2172 8.2948 1 17 15 >>>> märka<>ansträngning<>2471 7.4301 1 26 15 >>>> göra<>ansträngning<>2915 6.3704 3 1323 15 >>>> fortsätta<>ansträngning<>3097 6.0043 1 53 15 >>>> kosta<>ansträngning<>3723 4.8170 1 97 15 >>>> och<>ansträngning<>4162 4.0424 1 145 15 >>>> sätta<>ansträngning<>4482 3.4540 1 198 15 >>>> lägga<>ansträngning<>4745 3.0005 1 253 15 >>>> >>>> intensifiera<>ansträngning<>3951 40.5247 3 22 33 >>>> göra<>ansträngning<>4665 35.6553 12 20089 33 >>>> fortsätta<>ansträngning<>8254 21.8238 3 468 33 >>>> kräva<>ansträngning<>10206 17.4829 3 973 33 >>>> trotsa<>ansträngning<>17176 9.9897 1 39 33 >>>> välkomna<>ansträngning<>18254 9.3712 1 53 33 >>>> underlätta<>ansträngning<>20704 8.1388 1 98 33 >>>> döma<>ansträngning<>22762 7.1873 1 158 33 >>>> skada<>ansträngning<>23084 7.0537 1 169 33 >>>> rikta<>ansträngning<>23084 7.0537 1 169 33 >>>> ha<>ansträngning<>23176 7.0134 1 89009 33 >>>> stödja<>ansträngning<>25349 6.1642 1 265 33 >>>> krävas<>ansträngning<>25718 6.0348 1 283 33 >>>> vara<>ansträngning<>29926 4.5609 1 603 33 >>>> leda<>ansträngning<>30789 4.2612 1 705 33 >>>> öka<>ansträngning<>33145 3.4625 1 1076 33 >>>> >>> >>> >> > >