Suppose we have a file with 10 million lines, with a lot of
repeats (realistic example: a huge log with a lot of emails or
IPs to consider).

This sentence can create such a file:

  'names.list' fwrites~ a. {~ (a.i.'a') + 10 #.inv 1e7 ?@$ 1000

And here are some ways to get the top 5 occurrences in this
list, with their rough costs:

(5s, 1.3 GB) Bash oneliner:

  sort names.list| uniq -c|sort -n|tail -5

(2s, 6 MB) Perl, using a hash to count words:

  #! /usr/bin/env perl
  use strict;
  use warnings;

  my %words;
  open my $ws, '<', 'names.list' or die $!;
  while (defined (my $w = <$ws>)) {
      chomp $w;
      $words{$w}++;
  }
  printf "%7d %s\n", $words{$_}, $_ for
      (sort { $words{$b} <=> $words{$a} } keys %words)[0..5];

(12s, 11 GB) J, avoiding immutable i. hash tables:

  words =: s:<;._2 fread 'names.list'
  uniq  =: ~. words
  count =: +/"1 uniq ="0 1 words
  top =: ((#count) - >:i.5) ({ /:) count
  echo (5 s: top { uniq) ,: <"0 top { count

(9s, 1.5 GB) J, trying to use immutable i. hash tables:

  words =: 0$<''
  count =: 0$0
  search =: words&i.
  3 : 0''
    for_w. 'b' fread 'names.list' do.
      i =. search w
      if. i = #words do.
        words =: words , w
        search =: words&i.
        count =: count , 1
      else.
        count =: (>:i{count) i } count
      end.
    end.
  )
  top =: ((#count) - >:i.5) ({ /:) count
  echo (top { words) ,: <"0 top { count

This only has to rebuild the hash table 1e3 times despite
processing 1e7 lines, so 'search' does very well.

On less friendly data (by taking the names.list generator above
changing 1000 to 1e7) this solution didn't finish in 22 minutes.
Perl's also hit by the rebuilds, going from 2s to 19s.

(42s, 1.5 GB) J, trying a mutable hash table from RosettaCode:

  coclass 'dictionary'
  okchar=:~. (,toupper) '0123456789abcdefghijklmnopqrstuz'
  ok=: ] [ [: assert [: *./ e.&okchar
  intern=: [: ('z' , ok)&.> boxxopen
  has=: _1 < nc@intern
  set=: 4 :'(intern x)=: y'
  get=: ".@>@intern

  words=: conew 'dictionary'
  keys=:0$<''
  3 : 0''
    for_w. 'b' fread 'names.list' do.
      if. has__words >w do.
        (>w) set__words 1+get__words (>w)
      else.
        (>w) set__words 1
        keys =: keys , w
      end.
    end.
  )
  values=:get__words each keys
  echo echo (5 {. keys \: values) ,: (5 {. \:~ values)

(1.5s, 450 MB) a jd solution:

  require 'data/jd'
  words=:'m' fread 'names.list'
  jdadminnew'tmp'
  jd'createtable t1 word byte ',":1{$words
  jd'insert';'t1';'word';words
'ws cn' =: 5&{. each 1{"1 jd'read count:count word by word from t1 order by count desc'
  echo (<"1 ws) ,: ;/cn


This is a problem where, in a language with mutable hash
tables, I'd use them and this would mostly work out well. In
J, I feel like I'm missing some useful primitive or some APL
technique. Is that the case?

Cheers,
Julian
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to