Suppose we have a file with 10 million lines, with a lot of
repeats (realistic example: a huge log with a lot of emails or
IPs to consider).
This sentence can create such a file:
'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000
And here are some ways to get the top 5 occurrences in this
list, with their rough costs:
(5s, 1.3 GB) Bash oneliner:
sort names.list| uniq -c|sort -n|tail -5
(2s, 6 MB) Perl, using a hash to count words:
#! /usr/bin/env perl
use strict;
use warnings;
my %words;
open my $ws, '<', 'names.list' or die $!;
while (defined (my $w = <$ws>)) {
chomp $w;
$words{$w}++;
}
printf "%7d %s\n", $words{$_}, $_ for
(sort { $words{$b} <=> $words{$a} } keys %words)[0..5];
(12s, 11 GB) J, avoiding immutable i. hash tables:
words =: s:<;._2 fread 'names.list'
uniq =: ~. words
count =: +/"1 uniq ="0 1 words
top =: ((#count) - >:i.5) ({ /:) count
echo (5 s: top { uniq) ,: <"0 top { count
(9s, 1.5 GB) J, trying to use immutable i. hash tables:
words =: 0$<''
count =: 0$0
search =: words&i.
3 : 0''
for_w. 'b' fread 'names.list' do.
i =. search w
if. i = #words do.
words =: words , w
search =: words&i.
count =: count , 1
else.
count =: (>:i{count) i } count
end.
end.
)
top =: ((#count) - >:i.5) ({ /:) count
echo (top { words) ,: <"0 top { count
This only has to rebuild the hash table 1e3 times despite
processing 1e7 lines, so 'search' does very well.
On less friendly data (by taking the names.list generator above
changing 1000 to 1e7) this solution didn't finish in 22 minutes.
Perl's also hit by the rebuilds, going from 2s to 19s.
(42s, 1.5 GB) J, trying a mutable hash table from RosettaCode:
coclass 'dictionary'
okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz'
ok=: ] [ [: assert [: *./ e.&okchar
intern=: [: ('z' , ok)&.> boxxopen
has=: _1 < nc@intern
set=: 4 :'(intern x)=: y'
get=: ".@>@intern
words=: conew 'dictionary'
keys=:0$<''
3 : 0''
for_w. 'b' fread 'names.list' do.
if. has__words >w do.
(>w) set__words 1+get__words (>w)
else.
(>w) set__words 1
keys =: keys , w
end.
end.
)
values=:get__words each keys
echo echo (5 {. keys \: values) ,: (5 {. \:~ values)
(1.5s, 450 MB) a jd solution:
require 'data/jd'
words=:'m' fread 'names.list'
jdadminnew'tmp'
jd'createtable t1 word byte ',":1{$words
jd'insert';'t1';'word';words
'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1
order by count desc'
echo (<"1 ws) ,: ;/cn
This is a problem where, in a language with mutable hash
tables, I'd use them and this would mostly work out well. In
J, I feel like I'm missing some useful primitive or some APL
technique. Is that the case?
Cheers,
Julian
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm