I did give that a try, at a cost of 9.8s and 1.4 GB:

  last   =: _1
  keys   =: 0$0
  counts =: 0$0
  3 : 0''
    for_w. 'b' fread 'names.list' do.
      s =. s: w
      if. last < 6 s: s do.
        last   =: 6 s: s
        keys   =: keys , s
        counts =: counts , 1
        first  =. 6 s: {.keys
      else.
        i =. (6 s: s) - first
        counts =: (>:i{counts) i } counts
      end.
    end.
  )
  echo (5 s: 5 {. keys \: counts) ,: (;/5 {. \:~ counts)
  exit 0

this doesn't use sparse tables, but exploits 6&s: returning
sequential indexes of sequentially interned symbols.

I don't have a good example of a need for hash tables anymore
since /. handles this case so well, but all in all I think
I'd rather get them from FFI than abuse locales or symbols.

On 2020-09-15 13:11, Michal Wallace wrote:
I think it might be nice if J had primitive hash tables available, perhaps
as a library or set of foreign words.

However, if you're using stock J and you need a hash table right now, you
can use the global hash table for symbols.
There is a pair of foreign words that let you load and save the state of
this hash table if you need it to be reusable,
but for this application it shouldn't be necessary.

Since you don't know how many unique strings there are until you read it,
you could use a sparse array to
represent the array of counters, or just check the size of the symbol table
periodically and see if you need
to grow your counters array.

All of the relevant symbol-table functions are in dyadic 's:'
https://www.jsoftware.com/help/dictionary/dsco.htm




On Mon, Sep 14, 2020 at 2:02 AM Raul Miller <rauldmil...@gmail.com> wrote:

Oops, sorry about that.

The _2 should be just a plain 2:

  wrds=: /:~[;._2 fread'names.list'
  b=: -.2-:/\wrds
  L=: -2-/\0,I.b,1
  o=: \: L
  words=: o{~.wrds
  counts=: o{L
  echo words ,:&(<"_1)&(5{.])counts

Thanks,

--
Raul

On Sun, Sep 13, 2020 at 6:37 PM Julian Fondren <jfond...@minimaltype.com>
wrote:
>
> This is very helpful, thank you, as it's much more efficient
> while being clearly a very different way to solve the problem.
>
> It does seem to be slightly wrong,
>
> ┌─────┬─────┬─────┬─────┬─────┐
> │bec  │cai  │acf  │eab  │aii  │
> ├─────┼─────┼─────┼─────┼─────┤
> │10355│10287│10262│10255│10248│
> └─────┴─────┴─────┴─────┴─────┘
>
> Those numbers are right but the names aren't. 'bec' only
> appears 10085 times in this file.
>
>
> On 2020-09-13 16:42, Raul Miller wrote:
> > This took slightly over two seconds for me (and 168megs memory):
> >
> >   wrds=: /:~[;._2 fread'names.list'
> >   b=: -.2-:/\wrds
> >   L=: -_2-/\0,I.b,1
> >   o=: \: L
> >   words=: o{~.wrds
> >   counts=: o{L
> >   echo words ,:&(<"_1)&(5{.])counts
> >
> > (And note that I didn't actually need b -- the bit vector which marks
> > value changes -- to be given a name, since I only used it once.)
> >
> > Generally speaking, with J and large data sets, you want to be
> > avoiding loops. Also, sorting tends to perform better than hashes on
> > large data sets. Hashes start making sense in different contexts.
> >
> > I hope this helps,
> >
> > --
> > Raul
> >
> > On Sun, Sep 13, 2020 at 5:21 PM Julian Fondren
> > <jfond...@minimaltype.com> wrote:
> >>
> >> Suppose we have a file with 10 million lines, with a lot of
> >> repeats (realistic example: a huge log with a lot of emails or
> >> IPs to consider).
> >>
> >> This sentence can create such a file:
> >>
> >>    'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000
> >>
> >> And here are some ways to get the top 5 occurrences in this
> >> list, with their rough costs:
> >>
> >> (5s, 1.3 GB) Bash oneliner:
> >>
> >>    sort names.list| uniq -c|sort -n|tail -5
> >>
> >> (2s, 6 MB) Perl, using a hash to count words:
> >>
> >>    #! /usr/bin/env perl
> >>    use strict;
> >>    use warnings;
> >>
> >>    my %words;
> >>    open my $ws, '<', 'names.list' or die $!;
> >>    while (defined (my $w = <$ws>)) {
> >>        chomp $w;
> >>        $words{$w}++;
> >>    }
> >>    printf "%7d %s\n", $words{$_}, $_ for
> >>        (sort { $words{$b} <=> $words{$a} } keys %words)[0..5];
> >>
> >> (12s, 11 GB) J, avoiding immutable i. hash tables:
> >>
> >>    words =: s:<;._2 fread 'names.list'
> >>    uniq  =: ~. words
> >>    count =: +/"1 uniq ="0 1 words
> >>    top =: ((#count) - >:i.5) ({ /:) count
> >>    echo (5 s: top { uniq) ,: <"0 top { count
> >>
> >> (9s, 1.5 GB) J, trying to use immutable i. hash tables:
> >>
> >>    words =: 0$<''
> >>    count =: 0$0
> >>    search =: words&i.
> >>    3 : 0''
> >>      for_w. 'b' fread 'names.list' do.
> >>        i =. search w
> >>        if. i = #words do.
> >>          words =: words , w
> >>          search =: words&i.
> >>          count =: count , 1
> >>        else.
> >>          count =: (>:i{count) i } count
> >>        end.
> >>      end.
> >>    )
> >>    top =: ((#count) - >:i.5) ({ /:) count
> >>    echo (top { words) ,: <"0 top { count
> >>
> >> This only has to rebuild the hash table 1e3 times despite
> >> processing 1e7 lines, so 'search' does very well.
> >>
> >> On less friendly data (by taking the names.list generator above
> >> changing 1000 to 1e7) this solution didn't finish in 22 minutes.
> >> Perl's also hit by the rebuilds, going from 2s to 19s.
> >>
> >> (42s, 1.5 GB) J, trying a mutable hash table from RosettaCode:
> >>
> >>    coclass 'dictionary'
> >>    okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz'
> >>    ok=: ] [ [: assert [: *./ e.&okchar
> >>    intern=: [: ('z' , ok)&.> boxxopen
> >>    has=: _1 < nc@intern
> >>    set=: 4 :'(intern x)=: y'
> >>    get=: ".@>@intern
> >>
> >>    words=: conew 'dictionary'
> >>    keys=:0$<''
> >>    3 : 0''
> >>      for_w. 'b' fread 'names.list' do.
> >>        if. has__words >w do.
> >>          (>w) set__words 1+get__words (>w)
> >>        else.
> >>          (>w) set__words 1
> >>          keys =: keys , w
> >>        end.
> >>      end.
> >>    )
> >>    values=:get__words each keys
> >>    echo echo (5 {. keys \: values) ,: (5 {. \:~ values)
> >>
> >> (1.5s, 450 MB) a jd solution:
> >>
> >>    require 'data/jd'
> >>    words=:'m' fread 'names.list'
> >>    jdadminnew'tmp'
> >>    jd'createtable t1 word byte ',":1{$words
> >>    jd'insert';'t1';'word';words
> >>    'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1
> >> order by count desc'
> >>    echo (<"1 ws) ,: ;/cn
> >>
> >>
> >> This is a problem where, in a language with mutable hash
> >> tables, I'd use them and this would mostly work out well. In
> >> J, I feel like I'm missing some useful primitive or some APL
> >> technique. Is that the case?
> >>
> >> Cheers,
> >> Julian
> >> ----------------------------------------------------------------------
> >> For information about J forums see
http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to