You might be able to speed this up even more by using:
nos =. x I.@:E. file
in place of:
nos =. I. x E. file
When optimizing J, it's good to keep the "special code" list handy:
http://www.jsoftware.com/help/dictionary/special.htm
and if you can reformulate any of your constructs to match one of those listed,
you should (like above).
Also, the expression (n i."1 (' ')){."0 1 (n) works too hard. Since all J
arrays are rectangular, in the end this expression will produce a rectangular
array, whose width is the length of the longest IP. If you're not going to box
the IPs to retain their heterogenous lengths, then it's better to calculate
this directly, as in n {.~ _ , >./ n i."1 ' ' .
Finally, it might be interesting to time the nub itself. Taken all together, I
might rewrite your code along these lines:
require 'jmf'
NB. Extract IPs
ip =: ] {~ I.@:E. +/ '255.255.255.255' (+i.)&#~ [
NB. Clean & nub IPs
nub =: ~.@:({.~ _ , >./@:(i."1&' '))
NB. Fetch data, extract IPs
fetch =: dyad define
NB. Mapped noun name
mnn =. 'file'
JCHAR map_jmf_ mnn;y
IPs =. x ip mnn~
unmap_jmf_ mnn
NB. Could avoid assigning IP (which we don't use
NB. except to return a result):
NB. (unmap_jmf_ mnn) ] x ip mnn~
IPs
)
test =: verb define
fn =. jpath '~temp\auth2.log' [
'/media/KINGSTON/logParse/messages.2'
txt =. ' rhost='
readt =. 6!:2 'IPs =. txt fetch fn'
nubt =. 6!:2 'IPs =. nub IPs'
smoutput ''
smoutput 'Read file and extract IPs.....', 's',~6j2 ": readt
smoutput 'Clean and nub list............', 's',~6j2 ": nubt
smoutput ''
smoutput 'Unique IPs:'
smoutput '-----------'
NB. Assigned locally within timed expressions
IPs
)
Raul wrote:
> You can get some idea of the different, on your own machine, by
> timing 1!:1 on that file.
I tried this, and it added between 0.03 and 0.05 seconds to the total time (on
a 38MB file I generated from data I found via google, searching on [ "rhost= "
filename:.log ]) . Granted, that represents between 33% and 45% of the total
time, but in absolute terms it's not so much. If Robert's data doubled in
size, the difference would still be less than a 10th of a second.
Given that, I would prefer plain old fread . Mapped files introduce
complexities and subtleties which aren't worth dealing to gain a 10th of a
second. For example, because mapped files have side effects which require
cleanup (i.e. unmap_jmf_ ) you can't really use them in the functional
data-passing manner which is so common and comfortable in J (and which makes
verb composition so easy).
The current exercise provides an illustration of this problem. I could write
the entire solution as nub @: ip @: fread . But I can't do that with mapped
files. This was a stumbling block when I wanted to compare the timings of
fread vs. mapped files. That is, I wanted to time file reading, IP extraction,
and scrubbing & nubbing independently.
With nub @: ip @: fread , this is very easy:
6!:2 'IPs =. fread filename'
6!:2 'IPs =. ip IPs'
6!:2 'IPs =. nub IPs'
But with the way the fetch is written (above), I can't do it. File-reading
and IP extraction are tangled up. The only way to do it is to rewrite the code
and put the unmap_jmf_ "somewhere else" [1]. But logically, it belongs in
fetch packaged up with the rest of the code related to the file [2].
This is not a knock against mapped files, it's just a characteristic of
side-effects in general. And I'm probably a bigger fan of J's functional
capabilities than most (and many "real" applications will have state and
side-effects which require clean up at "the end" anyway, so mapped files won't
add too much overhead).
-Dan
[1] So my options are:
(A) Leave the code entangled. Unsavory. Un-J-like.
(B) Disentangle the code, and put the unmap at the end
of the data flow, e.g.
([ unmap_jmf_ bind 'FILE') @: nub @: ip @: (3 : ('JCHAR map_jmf_
''FILE'';y';'FILE'))
But this is ugly and makes maintenance harder. The verb
is harder to read, as the last operation is unrelated to
the function of the verb (usually the last operation tells
you something important about the verb).
Plus the final reference to the file is disjoint from all
the other references, so you have to keep more state in
your head while reading the verb (i.e. it interferes
with locality of reference).
(C) Disentangle the code, by putting the unmap somewhere it
doesn't belong (e.g. at the bottom of test ). As
above: ugly and hard to maintain.
(D) As part of the larger application, manually track all file
mappings, and clean them up at "the end". This adds
complexity, but is the currently recommended option.
(E) As part of the larger application, blindly call unmapall_jmf_
at "the end". Bad idea, even JSoftware discourages it.
(F) Forget about the side effects, and hope J properly unmaps
files when it shuts down. Maybe the best current option.
(G) Use dangerous and unsavory hacks to emulate normal functional
data flow:
NB.!! Hope J unmaps the (anonymous) data when
NB.!! it finally goes out of scope.
mfread =: verb define
JCHAR map_jmf_ y;~mnn=.'FILE_base_'
y =. mnn~
erase mnn
mappings_jmf_=:}:mappings_jmf_
y
)
IPs =: nub @: ip @: mfread
... and I don't like any of them.
[2] And even within fetch using mapped files forced me to write superfluous
code (i.e. the assignment) because I need to unmap the file before the verb
returns, but the last executed line has to be IPs (i.e. not unmap_jmf_ ).
The only workaround depends on order of execution, a no-no.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm