This took slightly over two seconds for me (and 168megs memory):
wrds=: /:~[;._2 fread'names.list'
b=: -.2-:/\wrds
L=: -_2-/\0,I.b,1
o=: \: L
words=: o{~.wrds
counts=: o{L
echo words ,:&(<"_1)&(5{.])counts
(And note that I didn't actually need b -- the bit vector which marks
value changes -- to be given a name, since I only used it once.)
Generally speaking, with J and large data sets, you want to be
avoiding loops. Also, sorting tends to perform better than hashes on
large data sets. Hashes start making sense in different contexts.
I hope this helps,
--
Raul
On Sun, Sep 13, 2020 at 5:21 PM Julian Fondren <[email protected]> wrote:
>
> Suppose we have a file with 10 million lines, with a lot of
> repeats (realistic example: a huge log with a lot of emails or
> IPs to consider).
>
> This sentence can create such a file:
>
> 'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000
>
> And here are some ways to get the top 5 occurrences in this
> list, with their rough costs:
>
> (5s, 1.3 GB) Bash oneliner:
>
> sort names.list| uniq -c|sort -n|tail -5
>
> (2s, 6 MB) Perl, using a hash to count words:
>
> #! /usr/bin/env perl
> use strict;
> use warnings;
>
> my %words;
> open my $ws, '<', 'names.list' or die $!;
> while (defined (my $w = <$ws>)) {
> chomp $w;
> $words{$w}++;
> }
> printf "%7d %s\n", $words{$_}, $_ for
> (sort { $words{$b} <=> $words{$a} } keys %words)[0..5];
>
> (12s, 11 GB) J, avoiding immutable i. hash tables:
>
> words =: s:<;._2 fread 'names.list'
> uniq =: ~. words
> count =: +/"1 uniq ="0 1 words
> top =: ((#count) - >:i.5) ({ /:) count
> echo (5 s: top { uniq) ,: <"0 top { count
>
> (9s, 1.5 GB) J, trying to use immutable i. hash tables:
>
> words =: 0$<''
> count =: 0$0
> search =: words&i.
> 3 : 0''
> for_w. 'b' fread 'names.list' do.
> i =. search w
> if. i = #words do.
> words =: words , w
> search =: words&i.
> count =: count , 1
> else.
> count =: (>:i{count) i } count
> end.
> end.
> )
> top =: ((#count) - >:i.5) ({ /:) count
> echo (top { words) ,: <"0 top { count
>
> This only has to rebuild the hash table 1e3 times despite
> processing 1e7 lines, so 'search' does very well.
>
> On less friendly data (by taking the names.list generator above
> changing 1000 to 1e7) this solution didn't finish in 22 minutes.
> Perl's also hit by the rebuilds, going from 2s to 19s.
>
> (42s, 1.5 GB) J, trying a mutable hash table from RosettaCode:
>
> coclass 'dictionary'
> okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz'
> ok=: ] [ [: assert [: *./ e.&okchar
> intern=: [: ('z' , ok)&.> boxxopen
> has=: _1 < nc@intern
> set=: 4 :'(intern x)=: y'
> get=: ".@>@intern
>
> words=: conew 'dictionary'
> keys=:0$<''
> 3 : 0''
> for_w. 'b' fread 'names.list' do.
> if. has__words >w do.
> (>w) set__words 1+get__words (>w)
> else.
> (>w) set__words 1
> keys =: keys , w
> end.
> end.
> )
> values=:get__words each keys
> echo echo (5 {. keys \: values) ,: (5 {. \:~ values)
>
> (1.5s, 450 MB) a jd solution:
>
> require 'data/jd'
> words=:'m' fread 'names.list'
> jdadminnew'tmp'
> jd'createtable t1 word byte ',":1{$words
> jd'insert';'t1';'word';words
> 'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1
> order by count desc'
> echo (<"1 ws) ,: ;/cn
>
>
> This is a problem where, in a language with mutable hash
> tables, I'd use them and this would mostly work out well. In
> J, I feel like I'm missing some useful primitive or some APL
> technique. Is that the case?
>
> Cheers,
> Julian
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm