Re: [Jprogramming] Efficient counting of occurrences?

Julian Fondren Sun, 13 Sep 2020 15:38:18 -0700

This is very helpful, thank you, as it's much more efficient
while being clearly a very different way to solve the problem.


It does seem to be slightly wrong,

┌─────┬─────┬─────┬─────┬─────┐
│bec  │cai  │acf  │eab  │aii  │
├─────┼─────┼─────┼─────┼─────┤
│10355│10287│10262│10255│10248│
└─────┴─────┴─────┴─────┴─────┘

Those numbers are right but the names aren't. 'bec' only
appears 10085 times in this file.


On 2020-09-13 16:42, Raul Miller wrote:

This took slightly over two seconds for me (and 168megs memory):

  wrds=: /:~[;._2 fread'names.list'
  b=: -.2-:/\wrds
  L=: -_2-/\0,I.b,1
  o=: \: L
  words=: o{~.wrds
  counts=: o{L
  echo words ,:&(<"_1)&(5{.])counts

(And note that I didn't actually need b -- the bit vector which marks
value changes -- to be given a name, since I only used it once.)

Generally speaking, with J and large data sets, you want to be
avoiding loops. Also, sorting tends to perform better than hashes on
large data sets. Hashes start making sense in different contexts.

I hope this helps,

--
Raul

On Sun, Sep 13, 2020 at 5:21 PM Julian Fondren<[email protected]> wrote:


Suppose we have a file with 10 million lines, with a lot of
repeats (realistic example: a huge log with a lot of emails or
IPs to consider).

This sentence can create such a file:

   'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000

And here are some ways to get the top 5 occurrences in this
list, with their rough costs:

(5s, 1.3 GB) Bash oneliner:

   sort names.list| uniq -c|sort -n|tail -5

(2s, 6 MB) Perl, using a hash to count words:

   #! /usr/bin/env perl
   use strict;
   use warnings;

   my %words;
   open my $ws, '<', 'names.list' or die $!;
   while (defined (my $w = <$ws>)) {
       chomp $w;
       $words{$w}++;
   }
   printf "%7d %s\n", $words{$_}, $_ for
       (sort { $words{$b} <=> $words{$a} } keys %words)[0..5];

(12s, 11 GB) J, avoiding immutable i. hash tables:

   words =: s:<;._2 fread 'names.list'
   uniq  =: ~. words
   count =: +/"1 uniq ="0 1 words
   top =: ((#count) - >:i.5) ({ /:) count
   echo (5 s: top { uniq) ,: <"0 top { count

(9s, 1.5 GB) J, trying to use immutable i. hash tables:

   words =: 0$<''
   count =: 0$0
   search =: words&i.
   3 : 0''
     for_w. 'b' fread 'names.list' do.
       i =. search w
       if. i = #words do.
         words =: words , w
         search =: words&i.
         count =: count , 1
       else.
         count =: (>:i{count) i } count
       end.
     end.
   )
   top =: ((#count) - >:i.5) ({ /:) count
   echo (top { words) ,: <"0 top { count

This only has to rebuild the hash table 1e3 times despite
processing 1e7 lines, so 'search' does very well.

On less friendly data (by taking the names.list generator above
changing 1000 to 1e7) this solution didn't finish in 22 minutes.
Perl's also hit by the rebuilds, going from 2s to 19s.

(42s, 1.5 GB) J, trying a mutable hash table from RosettaCode:

   coclass 'dictionary'
   okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz'
   ok=: ] [ [: assert [: *./ e.&okchar
   intern=: [: ('z' , ok)&.> boxxopen
   has=: _1 < nc@intern
   set=: 4 :'(intern x)=: y'
   get=: ".@>@intern

   words=: conew 'dictionary'
   keys=:0$<''
   3 : 0''
     for_w. 'b' fread 'names.list' do.
       if. has__words >w do.
         (>w) set__words 1+get__words (>w)
       else.
         (>w) set__words 1
         keys =: keys , w
       end.
     end.
   )
   values=:get__words each keys
   echo echo (5 {. keys \: values) ,: (5 {. \:~ values)

(1.5s, 450 MB) a jd solution:

   require 'data/jd'
   words=:'m' fread 'names.list'
   jdadminnew'tmp'
   jd'createtable t1 word byte ',":1{$words
   jd'insert';'t1';'word';words
   'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1
order by count desc'
   echo (<"1 ws) ,: ;/cn


This is a problem where, in a language with mutable hash
tables, I'd use them and this would mostly work out well. In
J, I feel like I'm missing some useful primitive or some APL
technique. Is that the case?

Cheers,
Julian
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Efficient counting of occurrences?

Reply via email to