Hi Karin, Thanks again for this report. I believe I know what the problem is - I ran your data with rank.pl and got a value of -.7379, which seems at least reasonable. I puzzled over this a bit, but then went back and used an older version of rank.pl and got the value you reported of -1.100.
The difference in the older and newer version had to do with how ties are handling, and I think the new version is more correct. So, I think you might want to upgrade to version 1.25 of Text::NSP, and that will give you a version of rank.pl that will hopefully give you more reasonable results in this case. In general the differences between the old and the new only appear when there are a significant number of ties, as there was in this case... http://search.cpan.org/~tpederse/Text-NSP/ Please let me know if you have any other questions, and thanks again for your report! Good luck, Ted On Wed, Feb 6, 2013 at 6:15 AM, Ted Pedersen <tpede...@d.umn.edu> wrote: > Hi Karin, > > This is very interesting, and I will certainly look into this further and > report back! Thank you for the additional information on this, it does seem > like an interesting case. > > More soon! > Ted > > > On Wed, Feb 6, 2013 at 2:34 AM, Karin Cavallin > <karin.caval...@ling.gu.se>wrote: > >> Hi Ted >> >> since I compare 5 different corpora (size-wise and occurrence-wise), >> basically all the sets have different number of pairs. I have run rank.plon >> on more than 100.000 lexical sets, most of them get no ranking >> co-efficient since there are no co-occurrences between the sets, some of >> them do get a co-efficient ranging from -1.0000 to 1.000, as expected. One >> lexical set gets this -1.1000, the one I sent you. >> >> So, I don't think it is due to that the sets are too different, but >> something that is beyond me. That's why I though it was important to report >> it to you. >> >> /karin >> >> Karin Cavallin >> PhD Student in Computational Linguistics >> University of Gothenburg, Sweden >> >> ------------------------------ >> *Från:* duluth...@gmail.com [duluth...@gmail.com] för Ted Pedersen [ >> tpede...@d.umn.edu] >> *Skickat:* den 6 februari 2013 03:35 >> *Till:* ngram@yahoogroups.com >> *Cc:* Karin Cavallin >> *Ämne:* Re: [ngram] Fwd: -1.1000(sic!) as result from rank.pl [1 >> Attachment] >> >> Hi Karin, >> >> I think the problem you are having is due to the fact that you have >> different number of word pairs in each list, and the fact that most of the >> word pairs are unique to each list. In general rank.pl expects that the >> two input files be made up of the same pairs of words (just ranked >> differently by a different measure of association, for example). When that >> isn't the case, the program will eliminate any word pairs that aren't in >> both files and then run. So, I think this combination of issues is causing >> rank.pl to return this very unexpected value. >> >> My guess is that it's the fact that the number of input pairs is >> different in each file, but I will do a little more checking in the next >> day or two to really see for sure. Here's a link to the rank.pldocumentation >> that describes how this particular case is intended to be >> handled.... >> >> >> http://search.cpan.org/dist/Text-NSP/bin/utils/rank.pl#1.4._Dealing_with_Dissimilar_Lists_of_N-grams >> : >> >> More soon, >> Ted >> >> >> On Tue, Feb 5, 2013 at 10:06 AM, Ted Pedersen <tpede...@d.umn.edu> wrote: >> >>> ** >>> >>> >>> [Attachment(s)<#13caf6e89cd3f1a3_13caeb04e2f773ea_13cab1d2ea9de4f7_TopText>from >>> Ted Pedersen included below] >>> >>> ---------- Forwarded message ---------- >>> From: Karin Cavallin karin.caval...@ling.gu.se> >>> Date: Tue, Feb 5, 2013 at 8:53 AM >>> Subject: -1.1000(sic!) as result from rank.pl >>> To: "tpede...@umn.edu" tpede...@umn.edu> >>> >>> Dear professor Ted >>> >>> I didn't know whom to report this error to, so I hope you can forward >>> this to the appropriate receiver. >>> >>> I have been using the NSP for a while, especially the bigram packages. >>> I'm working with lexical sets verbal predicate and nominal objects, >>> and to collocational analysis on them. >>> I wanted to compare the ranking between sets coming from different >>> corpora. (I know it is quite uninteresting to do ranking on such >>> different data, but I am trying different things for my thesis.) >>> >>> Today I noticed one lexical set to be -1.1000, this should not be >>> possible! (I have only noticed this one time) >>> >>> karin$ rank.pl 65_anstr.txt 95_anstr.txt >>> Rank correlation coefficient = -1.1000 >>> >>> I attached the files which I get this weird outcome from. >>> >>> Best regards >>> /karin >>> >>> Karin Cavallin >>> PhD Student in Computational Linguistics >>> University of Gothenburg, Sweden >>> >>> sky<>ansträngning<>505 25.1952 2 5 15 >>> fördubbla<>ansträngning<>1582 10.8890 1 5 15 >>> koncentrera<>ansträngning<>1912 9.1951 1 11 15 >>> krävas<>ansträngning<>2172 8.2948 1 17 15 >>> underlätta<>ansträngning<>2172 8.2948 1 17 15 >>> märka<>ansträngning<>2471 7.4301 1 26 15 >>> göra<>ansträngning<>2915 6.3704 3 1323 15 >>> fortsätta<>ansträngning<>3097 6.0043 1 53 15 >>> kosta<>ansträngning<>3723 4.8170 1 97 15 >>> och<>ansträngning<>4162 4.0424 1 145 15 >>> sätta<>ansträngning<>4482 3.4540 1 198 15 >>> lägga<>ansträngning<>4745 3.0005 1 253 15 >>> >>> intensifiera<>ansträngning<>3951 40.5247 3 22 33 >>> göra<>ansträngning<>4665 35.6553 12 20089 33 >>> fortsätta<>ansträngning<>8254 21.8238 3 468 33 >>> kräva<>ansträngning<>10206 17.4829 3 973 33 >>> trotsa<>ansträngning<>17176 9.9897 1 39 33 >>> välkomna<>ansträngning<>18254 9.3712 1 53 33 >>> underlätta<>ansträngning<>20704 8.1388 1 98 33 >>> döma<>ansträngning<>22762 7.1873 1 158 33 >>> skada<>ansträngning<>23084 7.0537 1 169 33 >>> rikta<>ansträngning<>23084 7.0537 1 169 33 >>> ha<>ansträngning<>23176 7.0134 1 89009 33 >>> stödja<>ansträngning<>25349 6.1642 1 265 33 >>> krävas<>ansträngning<>25718 6.0348 1 283 33 >>> vara<>ansträngning<>29926 4.5609 1 603 33 >>> leda<>ansträngning<>30789 4.2612 1 705 33 >>> öka<>ansträngning<>33145 3.4625 1 1076 33 >>> >>> >> >> >