Re: [Jprogramming] Efficient counting of occurrences?

Joseph Novakovich Sun, 13 Sep 2020 15:32:10 -0700

Perhaps keying before grading fits the bill?

   'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000
40000000
   ;:^:_1 ":&.> 5 {. \:~ (#;{.)/.~ ];._2 (1!:1) < 'names.list'
10326 ffg
10315 eaf
10313 bcd
10290 did
10268 ffa
   timespacex '5 {. \:~ (#;{.)/.~ ];._2 (1!:1) < ''names.list'''
0.28896 3.69102e8


Apologies if I've misunderstood the task!

On 9/13/20, Raul Miller <rauldmil...@gmail.com> wrote:
> This took slightly over two seconds for me (and 168megs memory):
>
>   wrds=: /:~[;._2 fread'names.list'
>   b=: -.2-:/\wrds
>   L=: -_2-/\0,I.b,1
>   o=: \: L
>   words=: o{~.wrds
>   counts=: o{L
>   echo words ,:&(<"_1)&(5{.])counts
>
> (And note that I didn't actually need b -- the bit vector which marks
> value changes -- to be given a name, since I only used it once.)
>
> Generally speaking, with J and large data sets, you want to be
> avoiding loops. Also, sorting tends to perform better than hashes on
> large data sets. Hashes start making sense in different contexts.
>
> I hope this helps,
>
> --
> Raul
>
> On Sun, Sep 13, 2020 at 5:21 PM Julian Fondren <jfond...@minimaltype.com>
> wrote:
>>
>> Suppose we have a file with 10 million lines, with a lot of
>> repeats (realistic example: a huge log with a lot of emails or
>> IPs to consider).
>>
>> This sentence can create such a file:
>>
>>    'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000
>>
>> And here are some ways to get the top 5 occurrences in this
>> list, with their rough costs:
>>
>> (5s, 1.3 GB) Bash oneliner:
>>
>>    sort names.list| uniq -c|sort -n|tail -5
>>
>> (2s, 6 MB) Perl, using a hash to count words:
>>
>>    #! /usr/bin/env perl
>>    use strict;
>>    use warnings;
>>
>>    my %words;
>>    open my $ws, '<', 'names.list' or die $!;
>>    while (defined (my $w = <$ws>)) {
>>        chomp $w;
>>        $words{$w}++;
>>    }
>>    printf "%7d %s\n", $words{$_}, $_ for
>>        (sort { $words{$b} <=> $words{$a} } keys %words)[0..5];
>>
>> (12s, 11 GB) J, avoiding immutable i. hash tables:
>>
>>    words =: s:<;._2 fread 'names.list'
>>    uniq  =: ~. words
>>    count =: +/"1 uniq ="0 1 words
>>    top =: ((#count) - >:i.5) ({ /:) count
>>    echo (5 s: top { uniq) ,: <"0 top { count
>>
>> (9s, 1.5 GB) J, trying to use immutable i. hash tables:
>>
>>    words =: 0$<''
>>    count =: 0$0
>>    search =: words&i.
>>    3 : 0''
>>      for_w. 'b' fread 'names.list' do.
>>        i =. search w
>>        if. i = #words do.
>>          words =: words , w
>>          search =: words&i.
>>          count =: count , 1
>>        else.
>>          count =: (>:i{count) i } count
>>        end.
>>      end.
>>    )
>>    top =: ((#count) - >:i.5) ({ /:) count
>>    echo (top { words) ,: <"0 top { count
>>
>> This only has to rebuild the hash table 1e3 times despite
>> processing 1e7 lines, so 'search' does very well.
>>
>> On less friendly data (by taking the names.list generator above
>> changing 1000 to 1e7) this solution didn't finish in 22 minutes.
>> Perl's also hit by the rebuilds, going from 2s to 19s.
>>
>> (42s, 1.5 GB) J, trying a mutable hash table from RosettaCode:
>>
>>    coclass 'dictionary'
>>    okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz'
>>    ok=: ] [ [: assert [: *./ e.&okchar
>>    intern=: [: ('z' , ok)&.> boxxopen
>>    has=: _1 < nc@intern
>>    set=: 4 :'(intern x)=: y'
>>    get=: ".@>@intern
>>
>>    words=: conew 'dictionary'
>>    keys=:0$<''
>>    3 : 0''
>>      for_w. 'b' fread 'names.list' do.
>>        if. has__words >w do.
>>          (>w) set__words 1+get__words (>w)
>>        else.
>>          (>w) set__words 1
>>          keys =: keys , w
>>        end.
>>      end.
>>    )
>>    values=:get__words each keys
>>    echo echo (5 {. keys \: values) ,: (5 {. \:~ values)
>>
>> (1.5s, 450 MB) a jd solution:
>>
>>    require 'data/jd'
>>    words=:'m' fread 'names.list'
>>    jdadminnew'tmp'
>>    jd'createtable t1 word byte ',":1{$words
>>    jd'insert';'t1';'word';words
>>    'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1
>> order by count desc'
>>    echo (<"1 ws) ,: ;/cn
>>
>>
>> This is a problem where, in a language with mutable hash
>> tables, I'd use them and this would mostly work out well. In
>> J, I feel like I'm missing some useful primitive or some APL
>> technique. Is that the case?
>>
>> Cheers,
>> Julian
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Efficient counting of occurrences?

Reply via email to