No, this is exactly the task. Thank you as well!
Between these two solutions I think it's clear that
what I'm missing is a sufficient appreciation for the
table adverbs, / \ /. \.
On 2020-09-13 17:31, Joseph Novakovich wrote:
Perhaps keying before grading fits the bill?
'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000
40000000
;:^:_1 ":&.> 5 {. \:~ (#;{.)/.~ ];._2 (1!:1) < 'names.list'
10326 ffg
10315 eaf
10313 bcd
10290 did
10268 ffa
timespacex '5 {. \:~ (#;{.)/.~ ];._2 (1!:1) < ''names.list'''
0.28896 3.69102e8
Apologies if I've misunderstood the task!
On 9/13/20, Raul Miller <rauldmil...@gmail.com> wrote:
This took slightly over two seconds for me (and 168megs memory):
wrds=: /:~[;._2 fread'names.list'
b=: -.2-:/\wrds
L=: -_2-/\0,I.b,1
o=: \: L
words=: o{~.wrds
counts=: o{L
echo words ,:&(<"_1)&(5{.])counts
(And note that I didn't actually need b -- the bit vector which marks
value changes -- to be given a name, since I only used it once.)
Generally speaking, with J and large data sets, you want to be
avoiding loops. Also, sorting tends to perform better than hashes on
large data sets. Hashes start making sense in different contexts.
I hope this helps,
--
Raul
On Sun, Sep 13, 2020 at 5:21 PM Julian Fondren
<jfond...@minimaltype.com>
wrote:
Suppose we have a file with 10 million lines, with a lot of
repeats (realistic example: a huge log with a lot of emails or
IPs to consider).
This sentence can create such a file:
'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000
And here are some ways to get the top 5 occurrences in this
list, with their rough costs:
(5s, 1.3 GB) Bash oneliner:
sort names.list| uniq -c|sort -n|tail -5
(2s, 6 MB) Perl, using a hash to count words:
#! /usr/bin/env perl
use strict;
use warnings;
my %words;
open my $ws, '<', 'names.list' or die $!;
while (defined (my $w = <$ws>)) {
chomp $w;
$words{$w}++;
}
printf "%7d %s\n", $words{$_}, $_ for
(sort { $words{$b} <=> $words{$a} } keys %words)[0..5];
(12s, 11 GB) J, avoiding immutable i. hash tables:
words =: s:<;._2 fread 'names.list'
uniq =: ~. words
count =: +/"1 uniq ="0 1 words
top =: ((#count) - >:i.5) ({ /:) count
echo (5 s: top { uniq) ,: <"0 top { count
(9s, 1.5 GB) J, trying to use immutable i. hash tables:
words =: 0$<''
count =: 0$0
search =: words&i.
3 : 0''
for_w. 'b' fread 'names.list' do.
i =. search w
if. i = #words do.
words =: words , w
search =: words&i.
count =: count , 1
else.
count =: (>:i{count) i } count
end.
end.
)
top =: ((#count) - >:i.5) ({ /:) count
echo (top { words) ,: <"0 top { count
This only has to rebuild the hash table 1e3 times despite
processing 1e7 lines, so 'search' does very well.
On less friendly data (by taking the names.list generator above
changing 1000 to 1e7) this solution didn't finish in 22 minutes.
Perl's also hit by the rebuilds, going from 2s to 19s.
(42s, 1.5 GB) J, trying a mutable hash table from RosettaCode:
coclass 'dictionary'
okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz'
ok=: ] [ [: assert [: *./ e.&okchar
intern=: [: ('z' , ok)&.> boxxopen
has=: _1 < nc@intern
set=: 4 :'(intern x)=: y'
get=: ".@>@intern
words=: conew 'dictionary'
keys=:0$<''
3 : 0''
for_w. 'b' fread 'names.list' do.
if. has__words >w do.
(>w) set__words 1+get__words (>w)
else.
(>w) set__words 1
keys =: keys , w
end.
end.
)
values=:get__words each keys
echo echo (5 {. keys \: values) ,: (5 {. \:~ values)
(1.5s, 450 MB) a jd solution:
require 'data/jd'
words=:'m' fread 'names.list'
jdadminnew'tmp'
jd'createtable t1 word byte ',":1{$words
jd'insert';'t1';'word';words
'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1
order by count desc'
echo (<"1 ws) ,: ;/cn
This is a problem where, in a language with mutable hash
tables, I'd use them and this would mostly work out well. In
J, I feel like I'm missing some useful primitive or some APL
technique. Is that the case?
Cheers,
Julian
----------------------------------------------------------------------
For information about J forums see
http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm