At 18:13 -0400 2007/05/20, Terrence Brannon wrote:
The shootout specs for the sum-file benchmark
http://shootout.alioth.debian.org/debian/benchmark.php?test=sumcol&lang=all
require the use of line-oriented I/O... Raul's solution involved
reading the whole file in.
Line-oriented I/O is important - what happens when a file is larger
than memory?
Terrence,
On a lot of modern machines, that is mostly moot. You can also
map files and not "read" them directly. In an interactive session
using the file you pointed to earlier, that might look like this:
require 'jmf'
2 map_jmf_ 'data';'sumcol-input.txt'
$data
4393
+/ 0&". ;._2 data
500
(30 {. data) , '...' , _30{. data
276
498
-981
770
-401
702
966
...52
474
-731
758
-573
4
38
264
------
But about line at a time processing -- I have, on occasion, wanted
to process lines in a file. A utility I wrote to do that is:
getlines =: 3 : 0
100000 getlines y NB. Default BS is 100,000 bytes
:
bs =. x
fs =. fsize fn =. > 0{y
fl =. bs <. fs -fp =. > 1{y
buf =. ir fn;fp,fl
if. (fs = fp =. fp + fl) do. fp =. _1 end.
drop =. (<:#buf)-buf i: NL
if. ((drop ~: 0) *. fp = _1 ) do. echo '** Unexpected EOF **' end.
fp =. _1 >. fp - drop
fn;fp;buf }.~ -drop
)
NB. The right argument is 3 boxed things
NB. 'filename' ; file_pointer ; line_buffer
NB. So, to use you on the file above and limit the input to 100
NB. bytes (you can see why this is quite silly!
] work =: 50 getlines 'sumcol-input.txt';0;''
+----------------+--+------------------------------------------------+
|sumcol-input.txt|48|276 498 -981 770 -401 702 966 950 -853 -53 -293 |
+----------------+--+------------------------------------------------+
] work =: 50 getlines work
+----------------+--+-------------------------------------------------+
|sumcol-input.txt|97|604 288 892 -697 204 96 408 880 -7 -817 422 -261 |
+----------------+--+-------------------------------------------------+
] work =: 50 getlines work
+----------------+---+-------------------------------------------------+
|sumcol-input.txt|146|-485 -77 826 184 864 -751 626 812 -369 -353 -371 |
+----------------+---+-------------------------------------------------+
NB. Eventually, we get to the following results - and you
NB. can see that in an iteration we would use 2{work for
NB. our calculation and checking the file pointer to see
NB. if EOF (indicated by _1) had come along yet ...
] work =: 50 getlines work
+----------------+----+-----------------------------------------------+
|sumcol-input.txt|4362|338 248 494 130 404 358 600 -639 -125 -33 -965 |
+----------------+----+-----------------------------------------------+
] work =: 50 getlines work
+----------------+--+-------------------------------+
|sumcol-input.txt|_1|752 474 -731 758 -573 4 38 264 |
+----------------+--+-------------------------------+
NB. Clearly this isn't a "j way" to do things. The default
NB. buffer size of 100000 is something of a reality check.
NB. That size buffer will work through a very large file
NB. about as fast as 1000000 byte chunks. But there is a
NB. big overhead if you have very small buffers (or one
NB. line at a time.)
----
In a more practical sense, many large files are databases. Many
databases are really flat files that have fixed length fields and
can be viewed as a j rectangular array via mapping. That is, you
can map the file directly and give the line length (which does not
require that the file have ASCII line end characters) and then
define fields and operate with them.
Here is an example that I presented at the 2000 j user conference -
iMg5:~/Documents/jstuff/mci jkt$ ja mci.ijs
Mapped name of 113997.001 is cdf
$cdf NB. this gives the shape of a mapped phone bill database
564218 404
NB. The following is a function that summarizes the 564218 calls
mci_summary
3 : 0
((5 6 0{BI, Call_date), To_pid) mci_summary y
:
key =. x {"1 y
ot =. 1 60 10000%"1~ (#/.~key),.key+//. 0". (Bill_duration, BI, Cost){"1 y
/:~(~.key),.' ',.10 10j1 10j2 10j2 ": ot,. 100*%/"1 ]_1 _2{"1 ot
)
NB. Here are the global field (column) definitions -
BI NB. a column/field known to be all blanks
20
Call_date NB. The columns containing the date of calls
120 121 122 123 124 125 126 127
To_pid NB. A 3 character name for the Product ID
211 212 213
Bill_duration NB. Call duration in seconds
163 164 165 166 167 168
Cost NB. Cost of the call in 1e_4 cent units
218 219 220 221 222 223 224 225
mci_summary cdf
08 INA 1 0.2 0.01 3.05
08 INE 97 75.5 2.39 3.16
09 ALA 3613 2881.0 265.05 9.20
09 CAN 8715 10836.1 775.84 7.16
09 CAR 126 111.6 30.08 26.95
09 EDL 19 25.4 3.61 14.22
09 HAW 1107 1462.6 114.62 7.84
09 INA 20959 20996.7 625.97 2.98
09 INE 527143 539187.3 16501.82 3.06
09 INT 1777 2103.0 463.28 22.03
09 MEX 139 142.9 27.15 19.00
09 PUE 472 650.6 58.60 9.01
09 VIR 50 51.5 4.64 9.01
NB. The summary report columns are
NB. 0 - month billed (notice almost all calls were in September
NB. 1 - product/destination code (ALA - alaska, CAN - Canada, etc.
NB. 2 - number of calls
NB. 3 - total minutes aggregated by date/product
NB. 4 - cost of calls aggregated by date/product
NB. 5 - average cost (cents/minute) for calls by date/product
timex =: 6!:2 , 7!:[EMAIL PROTECTED]
timex 'mci_summary cdf'
1.4244 8.70376e7
Above shows that the report was generated in 1.4 seconds. This
compares very favorably with just counting the lines in the file:
iMg5:~/Documents/jstuff/mci jkt$ time wc -cl 113997.001
564218 227944072 113997.001
real 0m1.557s
user 0m0.852s
sys 0m0.363s
iMg5:~/Documents/jstuff/mci jkt$
In fact, the aggregation and report generation for the data is
quicker than the OS utility to count lines - I like this. You
can see that j used 87 megabytes (not a big load on my 1.5G
iMac) to process the 227,944,072 byte file.
The thing is that benchmarks which read a line at a time as in
the example that started your questions are just "not done that
way in j" - The real advantage is terse programs that subsume
detail much more than most programming languages.
- joey
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm