Yoel,

Some of the feedback you got suggested mapped files, others
suggested just reading the file. My own habits lean towards
reading the file and I have a utility verb that gets "lines"
while not exceeding a buffer size limit. I find that buffer
sizes > 100Kbytes generally make almost no difference in
processing time - in fact, processing can take longer on
larger chunks. Actually, the gain after 40Kbytes is minor
indeed.

But in your responses you indicated that you were interested
in not using (explicit) loops and doing it in a j style yet
being able to handle large files. j mapped files are certainly
needed in that case. There was also a suggestion of regex,
but my experience calling regex from j has been less than
satisfactory.

In my opinion, these things usually require some thought and
knowledge of the data and the objectives. If the pattern you
are searching for is "nice" (like your keyword 'csn') then
there are usually pretty good ways to have j gather the data.
To find an actual example to illustrate, I catenated the past
8 weeks worth of sendmail logs on my linux system to create
a file "maillogs" - here is some experimenting with it -

[EMAIL PROTECTED] mqueue]$ wc maillogs
 564175 6987478 75395162 maillogs

   that is, the file is 75Mbytes with 564,175 lines

[EMAIL PROTECTED] mqueue]$ ja  # starting jconsole
   version ''
j504/2005-03-16/15:30
Running in: Linux
   host 'cat /proc/cpuinfo'
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 5
model name      : Pentium II (Deschutes)
stepping        : 2
cpu MHz         : 399.071
cache size      : 512 KB
  ....

NB. not a very fast machine, but it does have 1Gbyte ram available

   require 'jmf'
   JCHAR map_jmf_ 'mls';'maillogs';'';1
NB. HIGHLY recommended to map read only... that is the 1 at the
NB. end of the mapping expression. There is a vicious side effect
NB. (IMHO a BUG) in setting an alias of a mapped name within a verb.

NB. My example is to get the size of messages that passed through
NB. sendmail. Typically there is a phrase like   size=1234,  in
NB. the log. The following is based on that.

   delim =: ','
   tag =: 'size='

   timex 'tagis =: I. tag E. mls'    NB. time and space to get indexes
3.49947 1.34481e8
timex 'sizes =: delim (_1: ". (] i."1 [) {."0 1 ]) (tagis +/ (#tag)+i. 12){mls'
0.431585 1.37452e7
   $sizes
43947
   +/ x: sizes
11572953524

Maybe these are some ideas you can use to attack your problem.

- joey


At 11:01  +0300 2006/05/14, Yoel Jacobsen wrote:
Hello,

I'm new to J so please forgive me if this is a FAQ.

I wrote some short sentences to parse a log file. I want to retrieve all the
unique values of some attribute. The way it shows in the log file is
<attribute name>SPACE<attribute value> such as "..... csn 92892849893284
..."

My initial (brute force) program is:

text =: 1!:1 < '/tmp/logfile'
words =: cutopen text
bv =: (<'csn') = words
srbv =: _1 |.!.0 bv
csns =: ~. srbv # words

Now csns holds the unique values as requested.

The program works fine for small files (few megabytes).

My question is, what should be done to make it work for large files (say,
1GB or more)? I guess it involves memory mapped files but I have no clue
where to continue from here.

Further, is there any notion of 'laziness' (evaluate only when the data is
really needed) in J? can a verb be decalred as a lazy verb?

Thanks,

Yoel
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to