RE: [Jprogramming] Scanning a large file

Oleg Kobchenko Mon, 15 May 2006 07:50:54 -0700

I believe you cannot map the entire file
at once, the limit is only 2GB
   'c0'8!:2]_1+2^31
2,147,483,647



--- Henry Rich <[EMAIL PROTECTED]> wrote:

> Try
> 
> x ([: I. E.) y
> 
> to get the list of places where the string x occurs.  This uses
> special code and doesn't create the entire result of E. .
> 
> Henry Rich
> 
> > -----Original Message-----
> > From: [EMAIL PROTECTED] 
> > [mailto:[EMAIL PROTECTED] On Behalf Of Yoel Jacobsen
> > Sent: Monday, May 15, 2006 10:09 AM
> > To: Programming forum
> > Subject: Re: [Jprogramming] Scanning a large file
> > 
> > It won't work for large files. E. returns a 'limit error'.
> > 
> > Yoel
> > 
> > On 5/14/06, Joey K Tuttle <[EMAIL PROTECTED]> wrote:
> > >
> > > Yoel,
> > >
> > > Some of the feedback you got suggested mapped files, others
> > > suggested just reading the file. My own habits lean towards
> > > reading the file and I have a utility verb that gets "lines"
> > > while not exceeding a buffer size limit. I find that buffer
> > > sizes > 100Kbytes generally make almost no difference in
> > > processing time - in fact, processing can take longer on
> > > larger chunks. Actually, the gain after 40Kbytes is minor
> > > indeed.
> > >
> > > But in your responses you indicated that you were interested
> > > in not using (explicit) loops and doing it in a j style yet
> > > being able to handle large files. j mapped files are certainly
> > > needed in that case. There was also a suggestion of regex,
> > > but my experience calling regex from j has been less than
> > > satisfactory.
> > >
> > > In my opinion, these things usually require some thought and
> > > knowledge of the data and the objectives. If the pattern you
> > > are searching for is "nice" (like your keyword 'csn') then
> > > there are usually pretty good ways to have j gather the data.
> > > To find an actual example to illustrate, I catenated the past
> > > 8 weeks worth of sendmail logs on my linux system to create
> > > a file "maillogs" - here is some experimenting with it -
> > >
> > > [EMAIL PROTECTED] mqueue]$ wc maillogs
> > >   564175 6987478 75395162 maillogs
> > >
> > >     that is, the file is 75Mbytes with 564,175 lines
> > >
> > > [EMAIL PROTECTED] mqueue]$ ja  # starting jconsole
> > >     version ''
> > > j504/2005-03-16/15:30
> > > Running in: Linux
> > >     host 'cat /proc/cpuinfo'
> > > processor       : 0
> > > vendor_id       : GenuineIntel
> > > cpu family      : 6
> > > model           : 5
> > > model name      : Pentium II (Deschutes)
> > > stepping        : 2
> > > cpu MHz         : 399.071
> > > cache size      : 512 KB
> > >    ....
> > >
> > > NB. not a very fast machine, but it does have 1Gbyte ram available
> > >
> > >     require 'jmf'
> > >     JCHAR map_jmf_ 'mls';'maillogs';'';1
> > > NB. HIGHLY recommended to map read only... that is the 1 at the
> > > NB. end of the mapping expression. There is a vicious side effect
> > > NB. (IMHO a BUG) in setting an alias of a mapped name within a verb.
> > >
> > > NB. My example is to get the size of messages that passed through
> > > NB. sendmail. Typically there is a phrase like   size=1234,  in
> > > NB. the log. The following is based on that.
> > >
> > >     delim =: ','
> > >     tag =: 'size='
> > >
> > >     timex 'tagis =: I. tag E. mls'    NB. time and space to 
> > get indexes
> > > 3.49947 1.34481e8
> > >     timex 'sizes =: delim (_1: ". (] i."1 [) {."0 1 ]) (tagis +/
> > > (#tag)+i. 12){mls'
> > > 0.431585 1.37452e7
> > >     $sizes
> > > 43947
> > >     +/ x: sizes
> > > 11572953524
> > >
> > > Maybe these are some ideas you can use to attack your problem.
> > >
> > > - joey
> > >
> > >
> > > At 11:01  +0300 2006/05/14, Yoel Jacobsen wrote:
> > > >Hello,
> > > >
> > > >I'm new to J so please forgive me if this is a FAQ.
> > > >
> > > >I wrote some short sentences to parse a log file. I want 
> > to retrieve all
> > > the
> > > >unique values of some attribute. The way it shows in the 
> > log file is
> > > ><attribute name>SPACE<attribute value> such as "..... csn 
> > 92892849893284
> > > >..."
> > > >
> > > >My initial (brute force) program is:
> > > >
> > > >text =: 1!:1 < '/tmp/logfile'
> > > >words =: cutopen text
> > > >bv =: (<'csn') = words
> > > >srbv =: _1 |.!.0 bv
> > > >csns =: ~. srbv # words
> > > >
> > > >Now csns holds the unique values as requested.
> > > >
> > > >The program works fine for small files (few megabytes).
> > > >
> > > >My question is, what should be done to make it work for 
> > large files (say,
> > > >1GB or more)? I guess it involves memory mapped files but 
> > I have no clue
> > > >where to continue from here.
> > > >
> > > >Further, is there any notion of 'laziness' (evaluate only 
> > when the data
> > > is
> > > >really needed) in J? can a verb be decalred as a lazy verb?
> > > >
> > > >Thanks,
> > > >
> > > >Yoel


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

RE: [Jprogramming] Scanning a large file

Reply via email to