Re: [Jprogramming] Efficient counting of occurrences?

Raul Miller Sun, 13 Sep 2020 23:03:18 -0700

Oops, sorry about that.

The _2 should be just a plain 2:


  wrds=: /:~[;._2 fread'names.list'
  b=: -.2-:/\wrds
  L=: -2-/\0,I.b,1
  o=: \: L
  words=: o{~.wrds
  counts=: o{L
  echo words ,:&(<"_1)&(5{.])counts

Thanks,

-- 
Raul

On Sun, Sep 13, 2020 at 6:37 PM Julian Fondren <[email protected]> wrote:
>
> This is very helpful, thank you, as it's much more efficient
> while being clearly a very different way to solve the problem.
>
> It does seem to be slightly wrong,
>
> ┌─────┬─────┬─────┬─────┬─────┐
> │bec  │cai  │acf  │eab  │aii  │
> ├─────┼─────┼─────┼─────┼─────┤
> │10355│10287│10262│10255│10248│
> └─────┴─────┴─────┴─────┴─────┘
>
> Those numbers are right but the names aren't. 'bec' only
> appears 10085 times in this file.
>
>
> On 2020-09-13 16:42, Raul Miller wrote:
> > This took slightly over two seconds for me (and 168megs memory):
> >
> >   wrds=: /:~[;._2 fread'names.list'
> >   b=: -.2-:/\wrds
> >   L=: -_2-/\0,I.b,1
> >   o=: \: L
> >   words=: o{~.wrds
> >   counts=: o{L
> >   echo words ,:&(<"_1)&(5{.])counts
> >
> > (And note that I didn't actually need b -- the bit vector which marks
> > value changes -- to be given a name, since I only used it once.)
> >
> > Generally speaking, with J and large data sets, you want to be
> > avoiding loops. Also, sorting tends to perform better than hashes on
> > large data sets. Hashes start making sense in different contexts.
> >
> > I hope this helps,
> >
> > --
> > Raul
> >
> > On Sun, Sep 13, 2020 at 5:21 PM Julian Fondren
> > <[email protected]> wrote:
> >>
> >> Suppose we have a file with 10 million lines, with a lot of
> >> repeats (realistic example: a huge log with a lot of emails or
> >> IPs to consider).
> >>
> >> This sentence can create such a file:
> >>
> >>    'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000
> >>
> >> And here are some ways to get the top 5 occurrences in this
> >> list, with their rough costs:
> >>
> >> (5s, 1.3 GB) Bash oneliner:
> >>
> >>    sort names.list| uniq -c|sort -n|tail -5
> >>
> >> (2s, 6 MB) Perl, using a hash to count words:
> >>
> >>    #! /usr/bin/env perl
> >>    use strict;
> >>    use warnings;
> >>
> >>    my %words;
> >>    open my $ws, '<', 'names.list' or die $!;
> >>    while (defined (my $w = <$ws>)) {
> >>        chomp $w;
> >>        $words{$w}++;
> >>    }
> >>    printf "%7d %s\n", $words{$_}, $_ for
> >>        (sort { $words{$b} <=> $words{$a} } keys %words)[0..5];
> >>
> >> (12s, 11 GB) J, avoiding immutable i. hash tables:
> >>
> >>    words =: s:<;._2 fread 'names.list'
> >>    uniq  =: ~. words
> >>    count =: +/"1 uniq ="0 1 words
> >>    top =: ((#count) - >:i.5) ({ /:) count
> >>    echo (5 s: top { uniq) ,: <"0 top { count
> >>
> >> (9s, 1.5 GB) J, trying to use immutable i. hash tables:
> >>
> >>    words =: 0$<''
> >>    count =: 0$0
> >>    search =: words&i.
> >>    3 : 0''
> >>      for_w. 'b' fread 'names.list' do.
> >>        i =. search w
> >>        if. i = #words do.
> >>          words =: words , w
> >>          search =: words&i.
> >>          count =: count , 1
> >>        else.
> >>          count =: (>:i{count) i } count
> >>        end.
> >>      end.
> >>    )
> >>    top =: ((#count) - >:i.5) ({ /:) count
> >>    echo (top { words) ,: <"0 top { count
> >>
> >> This only has to rebuild the hash table 1e3 times despite
> >> processing 1e7 lines, so 'search' does very well.
> >>
> >> On less friendly data (by taking the names.list generator above
> >> changing 1000 to 1e7) this solution didn't finish in 22 minutes.
> >> Perl's also hit by the rebuilds, going from 2s to 19s.
> >>
> >> (42s, 1.5 GB) J, trying a mutable hash table from RosettaCode:
> >>
> >>    coclass 'dictionary'
> >>    okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz'
> >>    ok=: ] [ [: assert [: *./ e.&okchar
> >>    intern=: [: ('z' , ok)&.> boxxopen
> >>    has=: _1 < nc@intern
> >>    set=: 4 :'(intern x)=: y'
> >>    get=: ".@>@intern
> >>
> >>    words=: conew 'dictionary'
> >>    keys=:0$<''
> >>    3 : 0''
> >>      for_w. 'b' fread 'names.list' do.
> >>        if. has__words >w do.
> >>          (>w) set__words 1+get__words (>w)
> >>        else.
> >>          (>w) set__words 1
> >>          keys =: keys , w
> >>        end.
> >>      end.
> >>    )
> >>    values=:get__words each keys
> >>    echo echo (5 {. keys \: values) ,: (5 {. \:~ values)
> >>
> >> (1.5s, 450 MB) a jd solution:
> >>
> >>    require 'data/jd'
> >>    words=:'m' fread 'names.list'
> >>    jdadminnew'tmp'
> >>    jd'createtable t1 word byte ',":1{$words
> >>    jd'insert';'t1';'word';words
> >>    'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1
> >> order by count desc'
> >>    echo (<"1 ws) ,: ;/cn
> >>
> >>
> >> This is a problem where, in a language with mutable hash
> >> tables, I'd use them and this would mostly work out well. In
> >> J, I feel like I'm missing some useful primitive or some APL
> >> technique. Is that the case?
> >>
> >> Cheers,
> >> Julian
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Efficient counting of occurrences?

Reply via email to