Re: [ngram] Fwd: -1.1000(sic!) as result from rank.pl

Ted Pedersen Sat, 09 Feb 2013 14:30:03 -0800

BTW, if you aren't sure whether you have the most current version of
rank.plinstalled, you can check this way:


marimba(300): rank.pl --version
rank.pl         -       version 0.03
Copyright (C) 2000-2012, Ted Pedersen & Satanjeev Banerjee & Bridget T
McInnes

The older version (that gave -1.1000) was version 0.01.


On Sat, Feb 9, 2013 at 4:24 PM, Ted Pedersen <tpede...@d.umn.edu> wrote:

> **
>
>
> Hi Karin,
>
> Thanks again for this report. I believe I know what the problem is - I ran
> your data with rank.pl and got a value of -.7379, which seems at least
> reasonable. I puzzled over this a bit, but then went back and used an older
> version of rank.pl and got the value you reported of -1.100.
>
> The difference in the older and newer version had to do with how ties are
> handling, and I think the new version is more correct. So, I think you
> might want to upgrade to version 1.25 of Text::NSP, and that will give you
> a version of rank.pl that will hopefully give you more reasonable results
> in this case. In general the differences between the old and the new only
> appear when there are a significant number of ties, as there was in this
> case...
>
> http://search.cpan.org/~tpederse/Text-NSP/
>
> Please let me know if you have any other questions, and thanks again for
> your report!
>
> Good luck,
> Ted
>
>
> On Wed, Feb 6, 2013 at 6:15 AM, Ted Pedersen <tpede...@d.umn.edu> wrote:
>
>> Hi Karin,
>>
>> This is very interesting, and I will certainly look into this further and
>> report back! Thank you for the additional information on this, it does seem
>> like an interesting case.
>>
>> More soon!
>> Ted
>>
>>
>> On Wed, Feb 6, 2013 at 2:34 AM, Karin Cavallin <karin.caval...@ling.gu.se
>> > wrote:
>>
>>>  Hi Ted
>>>
>>> since I compare 5 different corpora (size-wise and occurrence-wise),
>>> basically all the sets have different number of pairs. I have run
>>> rank.pl on on more than 100.000 lexical sets, most of them get no
>>> ranking co-efficient since there are no co-occurrences between the sets,
>>> some of them do get a co-efficient ranging from -1.0000 to 1.000, as
>>> expected. One lexical set gets this -1.1000, the one I sent you.
>>>
>>>  So, I don't think it is due to that the sets are too different, but
>>> something that is beyond me. That's why I though it was important to report
>>> it to you.
>>>
>>>  /karin
>>>
>>> Karin Cavallin
>>> PhD Student in Computational Linguistics
>>> University of Gothenburg, Sweden
>>>
>>>   ------------------------------
>>> *Från:* duluth...@gmail.com [duluth...@gmail.com] för Ted Pedersen [
>>> tpede...@d.umn.edu]
>>> *Skickat:* den 6 februari 2013 03:35
>>> *Till:* ngram@yahoogroups.com
>>> *Cc:* Karin Cavallin
>>> *Ämne:* Re: [ngram] Fwd: -1.1000(sic!) as result from rank.pl [1
>>> Attachment]
>>>
>>>   Hi Karin,
>>>
>>>  I think the problem you are having is due to the fact that you have
>>> different number of word pairs in each list, and the fact that most of the
>>> word pairs are unique to each list. In general rank.pl expects that the
>>> two input files be made up of the same pairs of words (just ranked
>>> differently by a different measure of association, for example). When that
>>> isn't the case, the program will eliminate any word pairs that aren't in
>>> both files and then run. So, I think this combination of issues is causing
>>> rank.pl to return this very unexpected value.
>>>
>>>  My guess is that it's the fact that the number of input pairs is
>>> different in each file, but I will do a little more checking in the next
>>> day or two to really see for sure. Here's a link to the 
>>> rank.pldocumentation that describes how this particular case is intended to 
>>> be
>>> handled....
>>>
>>>
>>> http://search.cpan.org/dist/Text-NSP/bin/utils/rank.pl#1.4._Dealing_with_Dissimilar_Lists_of_N-grams
>>> :
>>>
>>>  More soon,
>>> Ted
>>>
>>>
>>> On Tue, Feb 5, 2013 at 10:06 AM, Ted Pedersen <tpede...@d.umn.edu>wrote:
>>>
>>>> **
>>>>
>>>>  
>>>> [Attachment(s)<#13cc1126c4259859_13caf6e89cd3f1a3_13caeb04e2f773ea_13cab1d2ea9de4f7_TopText>from
>>>>  Ted Pedersen included below]
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Karin Cavallin karin.caval...@ling.gu.se>
>>>> Date: Tue, Feb 5, 2013 at 8:53 AM
>>>> Subject: -1.1000(sic!) as result from rank.pl
>>>> To: "tpede...@umn.edu" tpede...@umn.edu>
>>>>
>>>> Dear professor Ted
>>>>
>>>> I didn't know whom to report this error to, so I hope you can forward
>>>> this to the appropriate receiver.
>>>>
>>>> I have been using the NSP for a while, especially the bigram packages.
>>>> I'm working with lexical sets verbal predicate and nominal objects,
>>>> and to collocational analysis on them.
>>>> I wanted to compare the ranking between sets coming from different
>>>> corpora. (I know it is quite uninteresting to do ranking on such
>>>> different data, but I am trying different things for my thesis.)
>>>>
>>>> Today I noticed one lexical set to be -1.1000, this should not be
>>>> possible! (I have only noticed this one time)
>>>>
>>>> karin$ rank.pl 65_anstr.txt 95_anstr.txt
>>>> Rank correlation coefficient = -1.1000
>>>>
>>>> I attached the files which I get this weird outcome from.
>>>>
>>>> Best regards
>>>> /karin
>>>>
>>>> Karin Cavallin
>>>> PhD Student in Computational Linguistics
>>>> University of Gothenburg, Sweden
>>>>
>>>> sky<>ansträngning<>505 25.1952 2 5 15
>>>> fördubbla<>ansträngning<>1582 10.8890 1 5 15
>>>> koncentrera<>ansträngning<>1912 9.1951 1 11 15
>>>> krävas<>ansträngning<>2172 8.2948 1 17 15
>>>> underlätta<>ansträngning<>2172 8.2948 1 17 15
>>>> märka<>ansträngning<>2471 7.4301 1 26 15
>>>> göra<>ansträngning<>2915 6.3704 3 1323 15
>>>> fortsätta<>ansträngning<>3097 6.0043 1 53 15
>>>> kosta<>ansträngning<>3723 4.8170 1 97 15
>>>> och<>ansträngning<>4162 4.0424 1 145 15
>>>> sätta<>ansträngning<>4482 3.4540 1 198 15
>>>> lägga<>ansträngning<>4745 3.0005 1 253 15
>>>>
>>>> intensifiera<>ansträngning<>3951 40.5247 3 22 33
>>>> göra<>ansträngning<>4665 35.6553 12 20089 33
>>>> fortsätta<>ansträngning<>8254 21.8238 3 468 33
>>>> kräva<>ansträngning<>10206 17.4829 3 973 33
>>>> trotsa<>ansträngning<>17176 9.9897 1 39 33
>>>> välkomna<>ansträngning<>18254 9.3712 1 53 33
>>>> underlätta<>ansträngning<>20704 8.1388 1 98 33
>>>> döma<>ansträngning<>22762 7.1873 1 158 33
>>>> skada<>ansträngning<>23084 7.0537 1 169 33
>>>> rikta<>ansträngning<>23084 7.0537 1 169 33
>>>> ha<>ansträngning<>23176 7.0134 1 89009 33
>>>> stödja<>ansträngning<>25349 6.1642 1 265 33
>>>> krävas<>ansträngning<>25718 6.0348 1 283 33
>>>> vara<>ansträngning<>29926 4.5609 1 603 33
>>>> leda<>ansträngning<>30789 4.2612 1 705 33
>>>> öka<>ansträngning<>33145 3.4625 1 1076 33
>>>>
>>>
>>>
>>
>  
>

Re: [ngram] Fwd: -1.1000(sic!) as result from rank.pl

Reply via email to