This is very helpful, thank you, as it's much more efficient while being clearly a very different way to solve the problem.
It does seem to be slightly wrong, ┌─────┬─────┬─────┬─────┬─────┐ │bec │cai │acf │eab │aii │ ├─────┼─────┼─────┼─────┼─────┤ │10355│10287│10262│10255│10248│ └─────┴─────┴─────┴─────┴─────┘ Those numbers are right but the names aren't. 'bec' only appears 10085 times in this file. On 2020-09-13 16:42, Raul Miller wrote:
This took slightly over two seconds for me (and 168megs memory): wrds=: /:~[;._2 fread'names.list' b=: -.2-:/\wrds L=: -_2-/\0,I.b,1 o=: \: L words=: o{~.wrds counts=: o{L echo words ,:&(<"_1)&(5{.])counts (And note that I didn't actually need b -- the bit vector which marks value changes -- to be given a name, since I only used it once.) Generally speaking, with J and large data sets, you want to be avoiding loops. Also, sorting tends to perform better than hashes on large data sets. Hashes start making sense in different contexts. I hope this helps, -- RaulOn Sun, Sep 13, 2020 at 5:21 PM Julian Fondren <[email protected]> wrote:Suppose we have a file with 10 million lines, with a lot of repeats (realistic example: a huge log with a lot of emails or IPs to consider). This sentence can create such a file: 'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000 And here are some ways to get the top 5 occurrences in this list, with their rough costs: (5s, 1.3 GB) Bash oneliner: sort names.list| uniq -c|sort -n|tail -5 (2s, 6 MB) Perl, using a hash to count words: #! /usr/bin/env perl use strict; use warnings; my %words; open my $ws, '<', 'names.list' or die $!; while (defined (my $w = <$ws>)) { chomp $w; $words{$w}++; } printf "%7d %s\n", $words{$_}, $_ for (sort { $words{$b} <=> $words{$a} } keys %words)[0..5]; (12s, 11 GB) J, avoiding immutable i. hash tables: words =: s:<;._2 fread 'names.list' uniq =: ~. words count =: +/"1 uniq ="0 1 words top =: ((#count) - >:i.5) ({ /:) count echo (5 s: top { uniq) ,: <"0 top { count (9s, 1.5 GB) J, trying to use immutable i. hash tables: words =: 0$<'' count =: 0$0 search =: words&i. 3 : 0'' for_w. 'b' fread 'names.list' do. i =. search w if. i = #words do. words =: words , w search =: words&i. count =: count , 1 else. count =: (>:i{count) i } count end. end. ) top =: ((#count) - >:i.5) ({ /:) count echo (top { words) ,: <"0 top { count This only has to rebuild the hash table 1e3 times despite processing 1e7 lines, so 'search' does very well. On less friendly data (by taking the names.list generator above changing 1000 to 1e7) this solution didn't finish in 22 minutes. Perl's also hit by the rebuilds, going from 2s to 19s. (42s, 1.5 GB) J, trying a mutable hash table from RosettaCode: coclass 'dictionary' okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz' ok=: ] [ [: assert [: *./ e.&okchar intern=: [: ('z' , ok)&.> boxxopen has=: _1 < nc@intern set=: 4 :'(intern x)=: y' get=: ".@>@intern words=: conew 'dictionary' keys=:0$<'' 3 : 0'' for_w. 'b' fread 'names.list' do. if. has__words >w do. (>w) set__words 1+get__words (>w) else. (>w) set__words 1 keys =: keys , w end. end. ) values=:get__words each keys echo echo (5 {. keys \: values) ,: (5 {. \:~ values) (1.5s, 450 MB) a jd solution: require 'data/jd' words=:'m' fread 'names.list' jdadminnew'tmp' jd'createtable t1 word byte ',":1{$words jd'insert';'t1';'word';words 'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1 order by count desc' echo (<"1 ws) ,: ;/cn This is a problem where, in a language with mutable hash tables, I'd use them and this would mostly work out well. In J, I feel like I'm missing some useful primitive or some APL technique. Is that the case? Cheers, Julian ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
