This took slightly over two seconds for me (and 168megs memory): wrds=: /:~[;._2 fread'names.list' b=: -.2-:/\wrds L=: -_2-/\0,I.b,1 o=: \: L words=: o{~.wrds counts=: o{L echo words ,:&(<"_1)&(5{.])counts
(And note that I didn't actually need b -- the bit vector which marks value changes -- to be given a name, since I only used it once.) Generally speaking, with J and large data sets, you want to be avoiding loops. Also, sorting tends to perform better than hashes on large data sets. Hashes start making sense in different contexts. I hope this helps, -- Raul On Sun, Sep 13, 2020 at 5:21 PM Julian Fondren <jfond...@minimaltype.com> wrote: > > Suppose we have a file with 10 million lines, with a lot of > repeats (realistic example: a huge log with a lot of emails or > IPs to consider). > > This sentence can create such a file: > > 'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000 > > And here are some ways to get the top 5 occurrences in this > list, with their rough costs: > > (5s, 1.3 GB) Bash oneliner: > > sort names.list| uniq -c|sort -n|tail -5 > > (2s, 6 MB) Perl, using a hash to count words: > > #! /usr/bin/env perl > use strict; > use warnings; > > my %words; > open my $ws, '<', 'names.list' or die $!; > while (defined (my $w = <$ws>)) { > chomp $w; > $words{$w}++; > } > printf "%7d %s\n", $words{$_}, $_ for > (sort { $words{$b} <=> $words{$a} } keys %words)[0..5]; > > (12s, 11 GB) J, avoiding immutable i. hash tables: > > words =: s:<;._2 fread 'names.list' > uniq =: ~. words > count =: +/"1 uniq ="0 1 words > top =: ((#count) - >:i.5) ({ /:) count > echo (5 s: top { uniq) ,: <"0 top { count > > (9s, 1.5 GB) J, trying to use immutable i. hash tables: > > words =: 0$<'' > count =: 0$0 > search =: words&i. > 3 : 0'' > for_w. 'b' fread 'names.list' do. > i =. search w > if. i = #words do. > words =: words , w > search =: words&i. > count =: count , 1 > else. > count =: (>:i{count) i } count > end. > end. > ) > top =: ((#count) - >:i.5) ({ /:) count > echo (top { words) ,: <"0 top { count > > This only has to rebuild the hash table 1e3 times despite > processing 1e7 lines, so 'search' does very well. > > On less friendly data (by taking the names.list generator above > changing 1000 to 1e7) this solution didn't finish in 22 minutes. > Perl's also hit by the rebuilds, going from 2s to 19s. > > (42s, 1.5 GB) J, trying a mutable hash table from RosettaCode: > > coclass 'dictionary' > okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz' > ok=: ] [ [: assert [: *./ e.&okchar > intern=: [: ('z' , ok)&.> boxxopen > has=: _1 < nc@intern > set=: 4 :'(intern x)=: y' > get=: ".@>@intern > > words=: conew 'dictionary' > keys=:0$<'' > 3 : 0'' > for_w. 'b' fread 'names.list' do. > if. has__words >w do. > (>w) set__words 1+get__words (>w) > else. > (>w) set__words 1 > keys =: keys , w > end. > end. > ) > values=:get__words each keys > echo echo (5 {. keys \: values) ,: (5 {. \:~ values) > > (1.5s, 450 MB) a jd solution: > > require 'data/jd' > words=:'m' fread 'names.list' > jdadminnew'tmp' > jd'createtable t1 word byte ',":1{$words > jd'insert';'t1';'word';words > 'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1 > order by count desc' > echo (<"1 ws) ,: ;/cn > > > This is a problem where, in a language with mutable hash > tables, I'd use them and this would mostly work out well. In > J, I feel like I'm missing some useful primitive or some APL > technique. Is that the case? > > Cheers, > Julian > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm