Re: [Jprogramming] Efficient counting of occurrences?

Raul Miller Sun, 13 Sep 2020 14:44:18 -0700

This took slightly over two seconds for me (and 168megs memory):

  wrds=: /:~[;._2 fread'names.list'
  b=: -.2-:/\wrds
  L=: -_2-/\0,I.b,1
  o=: \: L
  words=: o{~.wrds
  counts=: o{L
  echo words ,:&(<"_1)&(5{.])counts


(And note that I didn't actually need b -- the bit vector which marks
value changes -- to be given a name, since I only used it once.)

Generally speaking, with J and large data sets, you want to be
avoiding loops. Also, sorting tends to perform better than hashes on
large data sets. Hashes start making sense in different contexts.

I hope this helps,

-- 
Raul

On Sun, Sep 13, 2020 at 5:21 PM Julian Fondren <jfond...@minimaltype.com> wrote:
>
> Suppose we have a file with 10 million lines, with a lot of
> repeats (realistic example: a huge log with a lot of emails or
> IPs to consider).
>
> This sentence can create such a file:
>
>    'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000
>
> And here are some ways to get the top 5 occurrences in this
> list, with their rough costs:
>
> (5s, 1.3 GB) Bash oneliner:
>
>    sort names.list| uniq -c|sort -n|tail -5
>
> (2s, 6 MB) Perl, using a hash to count words:
>
>    #! /usr/bin/env perl
>    use strict;
>    use warnings;
>
>    my %words;
>    open my $ws, '<', 'names.list' or die $!;
>    while (defined (my $w = <$ws>)) {
>        chomp $w;
>        $words{$w}++;
>    }
>    printf "%7d %s\n", $words{$_}, $_ for
>        (sort { $words{$b} <=> $words{$a} } keys %words)[0..5];
>
> (12s, 11 GB) J, avoiding immutable i. hash tables:
>
>    words =: s:<;._2 fread 'names.list'
>    uniq  =: ~. words
>    count =: +/"1 uniq ="0 1 words
>    top =: ((#count) - >:i.5) ({ /:) count
>    echo (5 s: top { uniq) ,: <"0 top { count
>
> (9s, 1.5 GB) J, trying to use immutable i. hash tables:
>
>    words =: 0$<''
>    count =: 0$0
>    search =: words&i.
>    3 : 0''
>      for_w. 'b' fread 'names.list' do.
>        i =. search w
>        if. i = #words do.
>          words =: words , w
>          search =: words&i.
>          count =: count , 1
>        else.
>          count =: (>:i{count) i } count
>        end.
>      end.
>    )
>    top =: ((#count) - >:i.5) ({ /:) count
>    echo (top { words) ,: <"0 top { count
>
> This only has to rebuild the hash table 1e3 times despite
> processing 1e7 lines, so 'search' does very well.
>
> On less friendly data (by taking the names.list generator above
> changing 1000 to 1e7) this solution didn't finish in 22 minutes.
> Perl's also hit by the rebuilds, going from 2s to 19s.
>
> (42s, 1.5 GB) J, trying a mutable hash table from RosettaCode:
>
>    coclass 'dictionary'
>    okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz'
>    ok=: ] [ [: assert [: *./ e.&okchar
>    intern=: [: ('z' , ok)&.> boxxopen
>    has=: _1 < nc@intern
>    set=: 4 :'(intern x)=: y'
>    get=: ".@>@intern
>
>    words=: conew 'dictionary'
>    keys=:0$<''
>    3 : 0''
>      for_w. 'b' fread 'names.list' do.
>        if. has__words >w do.
>          (>w) set__words 1+get__words (>w)
>        else.
>          (>w) set__words 1
>          keys =: keys , w
>        end.
>      end.
>    )
>    values=:get__words each keys
>    echo echo (5 {. keys \: values) ,: (5 {. \:~ values)
>
> (1.5s, 450 MB) a jd solution:
>
>    require 'data/jd'
>    words=:'m' fread 'names.list'
>    jdadminnew'tmp'
>    jd'createtable t1 word byte ',":1{$words
>    jd'insert';'t1';'word';words
>    'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1
> order by count desc'
>    echo (<"1 ws) ,: ;/cn
>
>
> This is a problem where, in a language with mutable hash
> tables, I'd use them and this would mostly work out well. In
> J, I feel like I'm missing some useful primitive or some APL
> technique. Is that the case?
>
> Cheers,
> Julian
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Efficient counting of occurrences?

Reply via email to