Re: [ngram] Fwd: -1.1000(sic!) as result from rank.pl

Ted Pedersen Sat, 09 Feb 2013 14:27:55 -0800

Hi Karin,

Thanks again for this report. I believe I know what the problem is - I ran
your data with rank.pl and got a value of -.7379, which seems at least
reasonable. I puzzled over this a bit, but then went back and used an older
version of rank.pl and got the value you reported of -1.100.


The difference in the older and newer version had to do with how ties are
handling, and I think the new version is more correct. So, I think you
might want to upgrade to version 1.25 of Text::NSP, and that will give you
a version of rank.pl that will hopefully give you more reasonable results
in this case. In general the differences between the old and the new only
appear when there are a significant number of ties, as there was in this
case...

http://search.cpan.org/~tpederse/Text-NSP/

Please let me know if you have any other questions, and thanks again for
your report!

Good luck,
Ted


On Wed, Feb 6, 2013 at 6:15 AM, Ted Pedersen <tpede...@d.umn.edu> wrote:

> Hi Karin,
>
> This is very interesting, and I will certainly look into this further and
> report back! Thank you for the additional information on this, it does seem
> like an interesting case.
>
> More soon!
> Ted
>
>
> On Wed, Feb 6, 2013 at 2:34 AM, Karin Cavallin 
> <karin.caval...@ling.gu.se>wrote:
>
>>  Hi Ted
>>
>> since I compare 5 different corpora (size-wise and occurrence-wise),
>> basically all the sets have different number of pairs. I have run rank.plon 
>> on more than 100.000 lexical sets, most of them get no ranking
>> co-efficient since there are no co-occurrences between the sets, some of
>> them do get a co-efficient ranging from -1.0000 to 1.000, as expected. One
>> lexical set gets this -1.1000, the one I sent you.
>>
>>  So, I don't think it is due to that the sets are too different, but
>> something that is beyond me. That's why I though it was important to report
>> it to you.
>>
>>  /karin
>>
>> Karin Cavallin
>> PhD Student in Computational Linguistics
>> University of Gothenburg, Sweden
>>
>>   ------------------------------
>> *Från:* duluth...@gmail.com [duluth...@gmail.com] för Ted Pedersen [
>> tpede...@d.umn.edu]
>> *Skickat:* den 6 februari 2013 03:35
>> *Till:* ngram@yahoogroups.com
>> *Cc:* Karin Cavallin
>> *Ämne:* Re: [ngram] Fwd: -1.1000(sic!) as result from rank.pl [1
>> Attachment]
>>
>>   Hi Karin,
>>
>>  I think the problem you are having is due to the fact that you have
>> different number of word pairs in each list, and the fact that most of the
>> word pairs are unique to each list. In general rank.pl expects that the
>> two input files be made up of the same pairs of words (just ranked
>> differently by a different measure of association, for example). When that
>> isn't the case, the program will eliminate any word pairs that aren't in
>> both files and then run. So, I think this combination of issues is causing
>> rank.pl to return this very unexpected value.
>>
>>  My guess is that it's the fact that the number of input pairs is
>> different in each file, but I will do a little more checking in the next
>> day or two to really see for sure. Here's a link to the rank.pldocumentation 
>> that describes how this particular case is intended to be
>> handled....
>>
>>
>> http://search.cpan.org/dist/Text-NSP/bin/utils/rank.pl#1.4._Dealing_with_Dissimilar_Lists_of_N-grams
>> :
>>
>>  More soon,
>> Ted
>>
>>
>> On Tue, Feb 5, 2013 at 10:06 AM, Ted Pedersen <tpede...@d.umn.edu> wrote:
>>
>>> **
>>>
>>>  
>>> [Attachment(s)<#13caf6e89cd3f1a3_13caeb04e2f773ea_13cab1d2ea9de4f7_TopText>from
>>>  Ted Pedersen included below]
>>>
>>> ---------- Forwarded message ----------
>>> From: Karin Cavallin karin.caval...@ling.gu.se>
>>> Date: Tue, Feb 5, 2013 at 8:53 AM
>>> Subject: -1.1000(sic!) as result from rank.pl
>>> To: "tpede...@umn.edu" tpede...@umn.edu>
>>>
>>> Dear professor Ted
>>>
>>> I didn't know whom to report this error to, so I hope you can forward
>>> this to the appropriate receiver.
>>>
>>> I have been using the NSP for a while, especially the bigram packages.
>>> I'm working with lexical sets verbal predicate and nominal objects,
>>> and to collocational analysis on them.
>>> I wanted to compare the ranking between sets coming from different
>>> corpora. (I know it is quite uninteresting to do ranking on such
>>> different data, but I am trying different things for my thesis.)
>>>
>>> Today I noticed one lexical set to be -1.1000, this should not be
>>> possible! (I have only noticed this one time)
>>>
>>> karin$ rank.pl 65_anstr.txt 95_anstr.txt
>>> Rank correlation coefficient = -1.1000
>>>
>>> I attached the files which I get this weird outcome from.
>>>
>>> Best regards
>>> /karin
>>>
>>> Karin Cavallin
>>> PhD Student in Computational Linguistics
>>> University of Gothenburg, Sweden
>>>
>>> sky<>ansträngning<>505 25.1952 2 5 15
>>> fördubbla<>ansträngning<>1582 10.8890 1 5 15
>>> koncentrera<>ansträngning<>1912 9.1951 1 11 15
>>> krävas<>ansträngning<>2172 8.2948 1 17 15
>>> underlätta<>ansträngning<>2172 8.2948 1 17 15
>>> märka<>ansträngning<>2471 7.4301 1 26 15
>>> göra<>ansträngning<>2915 6.3704 3 1323 15
>>> fortsätta<>ansträngning<>3097 6.0043 1 53 15
>>> kosta<>ansträngning<>3723 4.8170 1 97 15
>>> och<>ansträngning<>4162 4.0424 1 145 15
>>> sätta<>ansträngning<>4482 3.4540 1 198 15
>>> lägga<>ansträngning<>4745 3.0005 1 253 15
>>>
>>> intensifiera<>ansträngning<>3951 40.5247 3 22 33
>>> göra<>ansträngning<>4665 35.6553 12 20089 33
>>> fortsätta<>ansträngning<>8254 21.8238 3 468 33
>>> kräva<>ansträngning<>10206 17.4829 3 973 33
>>> trotsa<>ansträngning<>17176 9.9897 1 39 33
>>> välkomna<>ansträngning<>18254 9.3712 1 53 33
>>> underlätta<>ansträngning<>20704 8.1388 1 98 33
>>> döma<>ansträngning<>22762 7.1873 1 158 33
>>> skada<>ansträngning<>23084 7.0537 1 169 33
>>> rikta<>ansträngning<>23084 7.0537 1 169 33
>>> ha<>ansträngning<>23176 7.0134 1 89009 33
>>> stödja<>ansträngning<>25349 6.1642 1 265 33
>>> krävas<>ansträngning<>25718 6.0348 1 283 33
>>> vara<>ansträngning<>29926 4.5609 1 603 33
>>> leda<>ansträngning<>30789 4.2612 1 705 33
>>> öka<>ansträngning<>33145 3.4625 1 1076 33
>>>  
>>>
>>
>>
>

Re: [ngram] Fwd: -1.1000(sic!) as result from rank.pl

Reply via email to