Re: [Jprogramming] Scanning a large file

Oleg Kobchenko Sun, 14 May 2006 19:12:10 -0700

We need a general purpose read line functionality.
It is common in C runtime and in other languages.
Although, it is possible to do in J, but it's better not
to do the low-level stuff every time.


Chris has shown how to do it in a way specific for
a concrete example. It is suggested to separate the
reading part from processing, so that reading could be
reused.

Here is a list constraints:
 - it's OK to assume LF line separators only (no CR)
 - read every byte of the file once and only once
 - proceess empty lines
 - proceess non-terminated last line
 - be fast and lean

Here is an approach that keeps the state of
file management out of the user code by means of
a callback for each line.

It calculates wc for 1Mb file on P2.8GHz in 1.7 sec.

   (wc FN) , ts'wc FN'
80000 200000 999999 1.6866 95808


NB. =========================================================
NB. readlines -- line reader

require 'files'

SB=: 10000

readlines=: 1 : 0
  assert fexist y
  S=. fsize y
  P=. 0
  B=. ''
  while. P < S do.
    B=. B,fread y ; P,SR=. SB<.S-P
    P=. P+SR
    if. (#B) >: L=. 1 + B i:LF do.
      u ;.2 L {. B
      B=.   L }. B
    end.
  end.
  if. #B do. u B end.
)

NB. =========================================================
NB. user code

lwc=: 3 : 0
  LC=: LC + 1
  WC=: WC + #@;: }:^:(LF={:)y
  CC=: CC + #y
)

wc=: 3 : 0
  LC=: WC=: CC=: 0
  lwc readlines y
  LC , WC , CC
)

ts=: 6!:2 , 7!:[EMAIL PROTECTED]

A=: 20000 ((* #) $ ]) 0 : 0
one two three four five

six seven
eight nine ten
)

0 : 0
  (}:A) fwrite FN=: jpath '~temp/t1.txt'
  (wc FN) , ts'wc FN'
)
NB. =========================================================


--- Chris Burke <[EMAIL PROTECTED]> wrote:

> Yoel Jacobsen wrote:
> > I wrote some short sentences to parse a log file. I want to retrieve all
> > the
> > unique values of some attribute. The way it shows in the log file is
> > <attribute name>SPACE<attribute value> such as "..... csn 92892849893284
> > ..."
> > 
> > My initial (brute force) program is:
> > 
> > text =: 1!:1 < '/tmp/logfile'
> > words =: cutopen text
> > bv =: (<'csn') = words
> > srbv =: _1 |.!.0 bv
> > csns =: ~. srbv # words
> > 
> > Now csns holds the unique values as requested.
> > 
> > The program works fine for small files (few megabytes).
> 
> Probably the simplest way to handle this is to read the file in large
> blocks, and chop the blocks into lines. Since lines are of uneven
> length, the blocks will likely not end in a line separator, so need to
> be truncated.
> 
> You don't need to memory map the file.
> 
> The following example assumes each line ends in LF:
> 
> getcsn=: 3 : 0
> siz=. fsize y
> blk=. 1e7
> ptr=. 0
> res=. ''
> while. ptr < siz do.
>   len=. blk <. siz - ptr
>   dat=. fread y;ptr,len
>   lfx=. 1 + dat i: LF
>   ptr=. ptr + lfx
>   dat=. <;._2 lfx {. dat
>   key=. (dat i.&> ' ') {. each dat
>   msk=. key = <'csn'
>   res=. ~. res, msk # dat
> end.
> 4 }. each res
> )
> 
> A=: 0 : 0
> abc qweqwe
> csn 1234
> def 123123
> csn 87654
> )
> 
>    A fwrites F=: jpath '~temp/t1.dat'
> 41
> 
>    getcsn F
> +----+-----+
> |1234|87654|
> +----+-----+
> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Scanning a large file

Reply via email to