We need a general purpose read line functionality.
It is common in C runtime and in other languages.
Although, it is possible to do in J, but it's better not
to do the low-level stuff every time.
Chris has shown how to do it in a way specific for
a concrete example. It is suggested to separate the
reading part from processing, so that reading could be
reused.
Here is a list constraints:
- it's OK to assume LF line separators only (no CR)
- read every byte of the file once and only once
- proceess empty lines
- proceess non-terminated last line
- be fast and lean
Here is an approach that keeps the state of
file management out of the user code by means of
a callback for each line.
It calculates wc for 1Mb file on P2.8GHz in 1.7 sec.
(wc FN) , ts'wc FN'
80000 200000 999999 1.6866 95808
NB. =========================================================
NB. readlines -- line reader
require 'files'
SB=: 10000
readlines=: 1 : 0
assert fexist y
S=. fsize y
P=. 0
B=. ''
while. P < S do.
B=. B,fread y ; P,SR=. SB<.S-P
P=. P+SR
if. (#B) >: L=. 1 + B i:LF do.
u ;.2 L {. B
B=. L }. B
end.
end.
if. #B do. u B end.
)
NB. =========================================================
NB. user code
lwc=: 3 : 0
LC=: LC + 1
WC=: WC + #@;: }:^:(LF={:)y
CC=: CC + #y
)
wc=: 3 : 0
LC=: WC=: CC=: 0
lwc readlines y
LC , WC , CC
)
ts=: 6!:2 , 7!:[EMAIL PROTECTED]
A=: 20000 ((* #) $ ]) 0 : 0
one two three four five
six seven
eight nine ten
)
0 : 0
(}:A) fwrite FN=: jpath '~temp/t1.txt'
(wc FN) , ts'wc FN'
)
NB. =========================================================
--- Chris Burke <[EMAIL PROTECTED]> wrote:
> Yoel Jacobsen wrote:
> > I wrote some short sentences to parse a log file. I want to retrieve all
> > the
> > unique values of some attribute. The way it shows in the log file is
> > <attribute name>SPACE<attribute value> such as "..... csn 92892849893284
> > ..."
> >
> > My initial (brute force) program is:
> >
> > text =: 1!:1 < '/tmp/logfile'
> > words =: cutopen text
> > bv =: (<'csn') = words
> > srbv =: _1 |.!.0 bv
> > csns =: ~. srbv # words
> >
> > Now csns holds the unique values as requested.
> >
> > The program works fine for small files (few megabytes).
>
> Probably the simplest way to handle this is to read the file in large
> blocks, and chop the blocks into lines. Since lines are of uneven
> length, the blocks will likely not end in a line separator, so need to
> be truncated.
>
> You don't need to memory map the file.
>
> The following example assumes each line ends in LF:
>
> getcsn=: 3 : 0
> siz=. fsize y
> blk=. 1e7
> ptr=. 0
> res=. ''
> while. ptr < siz do.
> len=. blk <. siz - ptr
> dat=. fread y;ptr,len
> lfx=. 1 + dat i: LF
> ptr=. ptr + lfx
> dat=. <;._2 lfx {. dat
> key=. (dat i.&> ' ') {. each dat
> msk=. key = <'csn'
> res=. ~. res, msk # dat
> end.
> 4 }. each res
> )
>
> A=: 0 : 0
> abc qweqwe
> csn 1234
> def 123123
> csn 87654
> )
>
> A fwrites F=: jpath '~temp/t1.dat'
> 41
>
> getcsn F
> +----+-----+
> |1234|87654|
> +----+-----+
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm