Perhaps keying before grading fits the bill? 'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000 40000000 ;:^:_1 ":&.> 5 {. \:~ (#;{.)/.~ ];._2 (1!:1) < 'names.list' 10326 ffg 10315 eaf 10313 bcd 10290 did 10268 ffa timespacex '5 {. \:~ (#;{.)/.~ ];._2 (1!:1) < ''names.list''' 0.28896 3.69102e8
Apologies if I've misunderstood the task! On 9/13/20, Raul Miller <rauldmil...@gmail.com> wrote: > This took slightly over two seconds for me (and 168megs memory): > > wrds=: /:~[;._2 fread'names.list' > b=: -.2-:/\wrds > L=: -_2-/\0,I.b,1 > o=: \: L > words=: o{~.wrds > counts=: o{L > echo words ,:&(<"_1)&(5{.])counts > > (And note that I didn't actually need b -- the bit vector which marks > value changes -- to be given a name, since I only used it once.) > > Generally speaking, with J and large data sets, you want to be > avoiding loops. Also, sorting tends to perform better than hashes on > large data sets. Hashes start making sense in different contexts. > > I hope this helps, > > -- > Raul > > On Sun, Sep 13, 2020 at 5:21 PM Julian Fondren <jfond...@minimaltype.com> > wrote: >> >> Suppose we have a file with 10 million lines, with a lot of >> repeats (realistic example: a huge log with a lot of emails or >> IPs to consider). >> >> This sentence can create such a file: >> >> 'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000 >> >> And here are some ways to get the top 5 occurrences in this >> list, with their rough costs: >> >> (5s, 1.3 GB) Bash oneliner: >> >> sort names.list| uniq -c|sort -n|tail -5 >> >> (2s, 6 MB) Perl, using a hash to count words: >> >> #! /usr/bin/env perl >> use strict; >> use warnings; >> >> my %words; >> open my $ws, '<', 'names.list' or die $!; >> while (defined (my $w = <$ws>)) { >> chomp $w; >> $words{$w}++; >> } >> printf "%7d %s\n", $words{$_}, $_ for >> (sort { $words{$b} <=> $words{$a} } keys %words)[0..5]; >> >> (12s, 11 GB) J, avoiding immutable i. hash tables: >> >> words =: s:<;._2 fread 'names.list' >> uniq =: ~. words >> count =: +/"1 uniq ="0 1 words >> top =: ((#count) - >:i.5) ({ /:) count >> echo (5 s: top { uniq) ,: <"0 top { count >> >> (9s, 1.5 GB) J, trying to use immutable i. hash tables: >> >> words =: 0$<'' >> count =: 0$0 >> search =: words&i. >> 3 : 0'' >> for_w. 'b' fread 'names.list' do. >> i =. search w >> if. i = #words do. >> words =: words , w >> search =: words&i. >> count =: count , 1 >> else. >> count =: (>:i{count) i } count >> end. >> end. >> ) >> top =: ((#count) - >:i.5) ({ /:) count >> echo (top { words) ,: <"0 top { count >> >> This only has to rebuild the hash table 1e3 times despite >> processing 1e7 lines, so 'search' does very well. >> >> On less friendly data (by taking the names.list generator above >> changing 1000 to 1e7) this solution didn't finish in 22 minutes. >> Perl's also hit by the rebuilds, going from 2s to 19s. >> >> (42s, 1.5 GB) J, trying a mutable hash table from RosettaCode: >> >> coclass 'dictionary' >> okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz' >> ok=: ] [ [: assert [: *./ e.&okchar >> intern=: [: ('z' , ok)&.> boxxopen >> has=: _1 < nc@intern >> set=: 4 :'(intern x)=: y' >> get=: ".@>@intern >> >> words=: conew 'dictionary' >> keys=:0$<'' >> 3 : 0'' >> for_w. 'b' fread 'names.list' do. >> if. has__words >w do. >> (>w) set__words 1+get__words (>w) >> else. >> (>w) set__words 1 >> keys =: keys , w >> end. >> end. >> ) >> values=:get__words each keys >> echo echo (5 {. keys \: values) ,: (5 {. \:~ values) >> >> (1.5s, 450 MB) a jd solution: >> >> require 'data/jd' >> words=:'m' fread 'names.list' >> jdadminnew'tmp' >> jd'createtable t1 word byte ',":1{$words >> jd'insert';'t1';'word';words >> 'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1 >> order by count desc' >> echo (<"1 ws) ,: ;/cn >> >> >> This is a problem where, in a language with mutable hash >> tables, I'd use them and this would mostly work out well. In >> J, I feel like I'm missing some useful primitive or some APL >> technique. Is that the case? >> >> Cheers, >> Julian >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm