Perhaps keying before grading fits the bill?
'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000
40000000
;:^:_1 ":&.> 5 {. \:~ (#;{.)/.~ ];._2 (1!:1) < 'names.list'
10326 ffg
10315 eaf
10313 bcd
10290 did
10268 ffa
timespacex '5 {. \:~ (#;{.)/.~ ];._2 (1!:1) < ''names.list'''
0.28896 3.69102e8
Apologies if I've misunderstood the task!
On 9/13/20, Raul Miller <[email protected]> wrote:
> This took slightly over two seconds for me (and 168megs memory):
>
> wrds=: /:~[;._2 fread'names.list'
> b=: -.2-:/\wrds
> L=: -_2-/\0,I.b,1
> o=: \: L
> words=: o{~.wrds
> counts=: o{L
> echo words ,:&(<"_1)&(5{.])counts
>
> (And note that I didn't actually need b -- the bit vector which marks
> value changes -- to be given a name, since I only used it once.)
>
> Generally speaking, with J and large data sets, you want to be
> avoiding loops. Also, sorting tends to perform better than hashes on
> large data sets. Hashes start making sense in different contexts.
>
> I hope this helps,
>
> --
> Raul
>
> On Sun, Sep 13, 2020 at 5:21 PM Julian Fondren <[email protected]>
> wrote:
>>
>> Suppose we have a file with 10 million lines, with a lot of
>> repeats (realistic example: a huge log with a lot of emails or
>> IPs to consider).
>>
>> This sentence can create such a file:
>>
>> 'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000
>>
>> And here are some ways to get the top 5 occurrences in this
>> list, with their rough costs:
>>
>> (5s, 1.3 GB) Bash oneliner:
>>
>> sort names.list| uniq -c|sort -n|tail -5
>>
>> (2s, 6 MB) Perl, using a hash to count words:
>>
>> #! /usr/bin/env perl
>> use strict;
>> use warnings;
>>
>> my %words;
>> open my $ws, '<', 'names.list' or die $!;
>> while (defined (my $w = <$ws>)) {
>> chomp $w;
>> $words{$w}++;
>> }
>> printf "%7d %s\n", $words{$_}, $_ for
>> (sort { $words{$b} <=> $words{$a} } keys %words)[0..5];
>>
>> (12s, 11 GB) J, avoiding immutable i. hash tables:
>>
>> words =: s:<;._2 fread 'names.list'
>> uniq =: ~. words
>> count =: +/"1 uniq ="0 1 words
>> top =: ((#count) - >:i.5) ({ /:) count
>> echo (5 s: top { uniq) ,: <"0 top { count
>>
>> (9s, 1.5 GB) J, trying to use immutable i. hash tables:
>>
>> words =: 0$<''
>> count =: 0$0
>> search =: words&i.
>> 3 : 0''
>> for_w. 'b' fread 'names.list' do.
>> i =. search w
>> if. i = #words do.
>> words =: words , w
>> search =: words&i.
>> count =: count , 1
>> else.
>> count =: (>:i{count) i } count
>> end.
>> end.
>> )
>> top =: ((#count) - >:i.5) ({ /:) count
>> echo (top { words) ,: <"0 top { count
>>
>> This only has to rebuild the hash table 1e3 times despite
>> processing 1e7 lines, so 'search' does very well.
>>
>> On less friendly data (by taking the names.list generator above
>> changing 1000 to 1e7) this solution didn't finish in 22 minutes.
>> Perl's also hit by the rebuilds, going from 2s to 19s.
>>
>> (42s, 1.5 GB) J, trying a mutable hash table from RosettaCode:
>>
>> coclass 'dictionary'
>> okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz'
>> ok=: ] [ [: assert [: *./ e.&okchar
>> intern=: [: ('z' , ok)&.> boxxopen
>> has=: _1 < nc@intern
>> set=: 4 :'(intern x)=: y'
>> get=: ".@>@intern
>>
>> words=: conew 'dictionary'
>> keys=:0$<''
>> 3 : 0''
>> for_w. 'b' fread 'names.list' do.
>> if. has__words >w do.
>> (>w) set__words 1+get__words (>w)
>> else.
>> (>w) set__words 1
>> keys =: keys , w
>> end.
>> end.
>> )
>> values=:get__words each keys
>> echo echo (5 {. keys \: values) ,: (5 {. \:~ values)
>>
>> (1.5s, 450 MB) a jd solution:
>>
>> require 'data/jd'
>> words=:'m' fread 'names.list'
>> jdadminnew'tmp'
>> jd'createtable t1 word byte ',":1{$words
>> jd'insert';'t1';'word';words
>> 'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1
>> order by count desc'
>> echo (<"1 ws) ,: ;/cn
>>
>>
>> This is a problem where, in a language with mutable hash
>> tables, I'd use them and this would mostly work out well. In
>> J, I feel like I'm missing some useful primitive or some APL
>> technique. Is that the case?
>>
>> Cheers,
>> Julian
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm