At 18:13  -0400 2007/05/20, Terrence Brannon wrote:
The shootout specs for the sum-file benchmark
http://shootout.alioth.debian.org/debian/benchmark.php?test=sumcol&lang=all
require the use of line-oriented I/O... Raul's solution involved
reading the whole file in.

Line-oriented I/O is important - what happens when a file is larger than memory?

Terrence,

On a lot of modern machines, that is mostly moot. You can also
map files and not "read" them directly. In an interactive session
using the file you pointed to earlier, that might look like this:

   require 'jmf'
   2 map_jmf_ 'data';'sumcol-input.txt'
   $data
4393
   +/ 0&". ;._2 data
500
   (30 {. data) , '...' , _30{. data
276
498
-981
770
-401
702
966
...52
474
-731
758
-573
4
38
264

------

But about line at a time processing -- I have, on occasion, wanted
to process lines in a file.  A utility I wrote to do that is:

   getlines =: 3 : 0
 100000 getlines y  NB. Default BS is 100,000 bytes
:
 bs =. x
 fs =. fsize fn =. > 0{y
 fl =. bs <. fs -fp =. > 1{y
 buf =. ir fn;fp,fl
  if. (fs = fp =. fp + fl) do. fp =. _1 end.
 drop =. (<:#buf)-buf i: NL
  if. ((drop ~: 0) *. fp = _1 ) do. echo '** Unexpected EOF **' end.
 fp =. _1 >. fp - drop
 fn;fp;buf }.~ -drop
)
NB. The right argument is 3 boxed things
NB.      'filename' ; file_pointer ; line_buffer
NB. So, to use you on the file above and limit the input to 100
NB. bytes (you can see why this is quite silly!

   ] work =: 50 getlines 'sumcol-input.txt';0;''
+----------------+--+------------------------------------------------+
|sumcol-input.txt|48|276 498 -981 770 -401 702 966 950 -853 -53 -293 |
+----------------+--+------------------------------------------------+
   ] work =: 50 getlines work
+----------------+--+-------------------------------------------------+
|sumcol-input.txt|97|604 288 892 -697 204 96 408 880 -7 -817 422 -261 |
+----------------+--+-------------------------------------------------+
   ] work =: 50 getlines work
+----------------+---+-------------------------------------------------+
|sumcol-input.txt|146|-485 -77 826 184 864 -751 626 812 -369 -353 -371 |
+----------------+---+-------------------------------------------------+

NB. Eventually, we get to the following results - and you
NB. can see that in an iteration we would use 2{work for
NB. our calculation and checking the file pointer to see
NB. if EOF (indicated by _1) had come along yet ...

   ] work =: 50 getlines work
+----------------+----+-----------------------------------------------+
|sumcol-input.txt|4362|338 248 494 130 404 358 600 -639 -125 -33 -965 |
+----------------+----+-----------------------------------------------+
   ] work =: 50 getlines work
+----------------+--+-------------------------------+
|sumcol-input.txt|_1|752 474 -731 758 -573 4 38 264 |
+----------------+--+-------------------------------+

NB. Clearly this isn't a "j way" to do things. The default
NB. buffer size of 100000 is something of a reality check.
NB. That size buffer will work through a very large file
NB. about as fast as 1000000 byte chunks. But there is a
NB. big overhead if you have very small buffers (or one
NB. line at a time.)


----

In a more practical sense, many large files are databases. Many
databases are really flat files that have fixed length fields and
can be viewed as a j rectangular array via mapping. That is, you
can map the file directly and give the line length (which does not
require that the file have ASCII line end characters) and then
define fields and operate with them.

Here is an example that I presented at the 2000 j user conference -

iMg5:~/Documents/jstuff/mci jkt$ ja mci.ijs
Mapped name of 113997.001 is  cdf

   $cdf    NB. this gives the shape of a mapped phone bill database
564218 404

NB. The following is a function that summarizes the 564218 calls

   mci_summary
3 : 0
((5 6 0{BI, Call_date), To_pid) mci_summary y
:
   key =.  x {"1 y
ot =. 1 60 10000%"1~ (#/.~key),.key+//. 0". (Bill_duration, BI, Cost){"1 y
/:~(~.key),.' ',.10 10j1 10j2 10j2 ": ot,. 100*%/"1 ]_1 _2{"1 ot
)

NB. Here are the global field (column) definitions -

   BI          NB. a column/field known to be all blanks
20

   Call_date   NB. The columns containing the date of calls
120 121 122 123 124 125 126 127

   To_pid      NB. A 3 character name for the Product ID
211 212 213

   Bill_duration  NB. Call duration in seconds
163 164 165 166 167 168

   Cost           NB. Cost of the call in 1e_4 cent units
218 219 220 221 222 223 224 225

   mci_summary cdf
08 INA          1       0.2      0.01      3.05
08 INE         97      75.5      2.39      3.16
09 ALA       3613    2881.0    265.05      9.20
09 CAN       8715   10836.1    775.84      7.16
09 CAR        126     111.6     30.08     26.95
09 EDL         19      25.4      3.61     14.22
09 HAW       1107    1462.6    114.62      7.84
09 INA      20959   20996.7    625.97      2.98
09 INE     527143  539187.3  16501.82      3.06
09 INT       1777    2103.0    463.28     22.03
09 MEX        139     142.9     27.15     19.00
09 PUE        472     650.6     58.60      9.01
09 VIR         50      51.5      4.64      9.01

NB. The summary report columns are
NB. 0 - month billed (notice almost all calls were in September
NB. 1 - product/destination code (ALA - alaska, CAN - Canada, etc.
NB. 2 - number of calls
NB. 3 - total minutes aggregated by date/product
NB. 4 - cost of calls aggregated by date/product
NB. 5 - average cost (cents/minute) for calls by date/product

   timex =: 6!:2 , 7!:[EMAIL PROTECTED]

   timex 'mci_summary cdf'
1.4244 8.70376e7

Above shows that the report was generated in 1.4 seconds. This
compares very favorably with just counting the lines in the file:

iMg5:~/Documents/jstuff/mci jkt$ time wc -cl 113997.001
  564218 227944072 113997.001

real    0m1.557s
user    0m0.852s
sys     0m0.363s
iMg5:~/Documents/jstuff/mci jkt$

In fact, the aggregation and report generation for the data is
quicker than the OS utility to count lines - I like this. You
can see that j used 87 megabytes (not a big load on my 1.5G
iMac) to process the 227,944,072 byte file.

The thing is that benchmarks which read a line at a time as in
the example that started your questions are just "not done that
way in j" - The real advantage is terse programs that subsume
detail much more than most programming languages.

- joey

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to