Oops, sorry about that.
The _2 should be just a plain 2:
wrds=: /:~[;._2 fread'names.list'
b=: -.2-:/\wrds
L=: -2-/\0,I.b,1
o=: \: L
words=: o{~.wrds
counts=: o{L
echo words ,:&(<"_1)&(5{.])counts
Thanks,
--
Raul
On Sun, Sep 13, 2020 at 6:37 PM Julian Fondren
<jfond...@minimaltype.com>
wrote:
>
> This is very helpful, thank you, as it's much more efficient
> while being clearly a very different way to solve the problem.
>
> It does seem to be slightly wrong,
>
> ┌─────┬─────┬─────┬─────┬─────┐
> │bec │cai │acf │eab │aii │
> ├─────┼─────┼─────┼─────┼─────┤
> │10355│10287│10262│10255│10248│
> └─────┴─────┴─────┴─────┴─────┘
>
> Those numbers are right but the names aren't. 'bec' only
> appears 10085 times in this file.
>
>
> On 2020-09-13 16:42, Raul Miller wrote:
> > This took slightly over two seconds for me (and 168megs memory):
> >
> > wrds=: /:~[;._2 fread'names.list'
> > b=: -.2-:/\wrds
> > L=: -_2-/\0,I.b,1
> > o=: \: L
> > words=: o{~.wrds
> > counts=: o{L
> > echo words ,:&(<"_1)&(5{.])counts
> >
> > (And note that I didn't actually need b -- the bit vector which marks
> > value changes -- to be given a name, since I only used it once.)
> >
> > Generally speaking, with J and large data sets, you want to be
> > avoiding loops. Also, sorting tends to perform better than hashes on
> > large data sets. Hashes start making sense in different contexts.
> >
> > I hope this helps,
> >
> > --
> > Raul
> >
> > On Sun, Sep 13, 2020 at 5:21 PM Julian Fondren
> > <jfond...@minimaltype.com> wrote:
> >>
> >> Suppose we have a file with 10 million lines, with a lot of
> >> repeats (realistic example: a huge log with a lot of emails or
> >> IPs to consider).
> >>
> >> This sentence can create such a file:
> >>
> >> 'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000
> >>
> >> And here are some ways to get the top 5 occurrences in this
> >> list, with their rough costs:
> >>
> >> (5s, 1.3 GB) Bash oneliner:
> >>
> >> sort names.list| uniq -c|sort -n|tail -5
> >>
> >> (2s, 6 MB) Perl, using a hash to count words:
> >>
> >> #! /usr/bin/env perl
> >> use strict;
> >> use warnings;
> >>
> >> my %words;
> >> open my $ws, '<', 'names.list' or die $!;
> >> while (defined (my $w = <$ws>)) {
> >> chomp $w;
> >> $words{$w}++;
> >> }
> >> printf "%7d %s\n", $words{$_}, $_ for
> >> (sort { $words{$b} <=> $words{$a} } keys %words)[0..5];
> >>
> >> (12s, 11 GB) J, avoiding immutable i. hash tables:
> >>
> >> words =: s:<;._2 fread 'names.list'
> >> uniq =: ~. words
> >> count =: +/"1 uniq ="0 1 words
> >> top =: ((#count) - >:i.5) ({ /:) count
> >> echo (5 s: top { uniq) ,: <"0 top { count
> >>
> >> (9s, 1.5 GB) J, trying to use immutable i. hash tables:
> >>
> >> words =: 0$<''
> >> count =: 0$0
> >> search =: words&i.
> >> 3 : 0''
> >> for_w. 'b' fread 'names.list' do.
> >> i =. search w
> >> if. i = #words do.
> >> words =: words , w
> >> search =: words&i.
> >> count =: count , 1
> >> else.
> >> count =: (>:i{count) i } count
> >> end.
> >> end.
> >> )
> >> top =: ((#count) - >:i.5) ({ /:) count
> >> echo (top { words) ,: <"0 top { count
> >>
> >> This only has to rebuild the hash table 1e3 times despite
> >> processing 1e7 lines, so 'search' does very well.
> >>
> >> On less friendly data (by taking the names.list generator above
> >> changing 1000 to 1e7) this solution didn't finish in 22 minutes.
> >> Perl's also hit by the rebuilds, going from 2s to 19s.
> >>
> >> (42s, 1.5 GB) J, trying a mutable hash table from RosettaCode:
> >>
> >> coclass 'dictionary'
> >> okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz'
> >> ok=: ] [ [: assert [: *./ e.&okchar
> >> intern=: [: ('z' , ok)&.> boxxopen
> >> has=: _1 < nc@intern
> >> set=: 4 :'(intern x)=: y'
> >> get=: ".@>@intern
> >>
> >> words=: conew 'dictionary'
> >> keys=:0$<''
> >> 3 : 0''
> >> for_w. 'b' fread 'names.list' do.
> >> if. has__words >w do.
> >> (>w) set__words 1+get__words (>w)
> >> else.
> >> (>w) set__words 1
> >> keys =: keys , w
> >> end.
> >> end.
> >> )
> >> values=:get__words each keys
> >> echo echo (5 {. keys \: values) ,: (5 {. \:~ values)
> >>
> >> (1.5s, 450 MB) a jd solution:
> >>
> >> require 'data/jd'
> >> words=:'m' fread 'names.list'
> >> jdadminnew'tmp'
> >> jd'createtable t1 word byte ',":1{$words
> >> jd'insert';'t1';'word';words
> >> 'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1
> >> order by count desc'
> >> echo (<"1 ws) ,: ;/cn
> >>
> >>
> >> This is a problem where, in a language with mutable hash
> >> tables, I'd use them and this would mostly work out well. In
> >> J, I feel like I'm missing some useful primitive or some APL
> >> technique. Is that the case?
> >>
> >> Cheers,
> >> Julian
> >> ----------------------------------------------------------------------
> >> For information about J forums see
http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm