Sorry, I forgot to paste in the runtime for the APL version. It's
approximately *6:30*. That's only 7 times slower than the Lisp version
doing the same thing which is not fantastic, but acceptable I suppose.
Here's the source to the Lisp version. As you can see, it tries really hard
to replicate the behaviour of the APL version, with all that copying.
(defparameter *result*
(time
(with-open-file (s "apjs492452t1_mrt.txt")
(let ((result (make-array '(0 11))))
(loop
for line = (read-line s nil)
while line
do (let ((parts (split-sequence:split-sequence #\Space
line :remove-empty-subseqs t))
(new-array (make-array (list (1+
(array-dimension result 0)) 11)))
(n (reduce #'* (array-dimensions result))))
(loop for i from 0 below n
do (setf (row-major-aref new-array i)
(row-major-aref result i)))
(loop for i from n repeat 10
for p in parts
do (setf (row-major-aref new-array i)
(parse-number:parse-number p)))
(setf (row-major-aref new-array (+ n 10)) (nth 10
parts))
(setq result new-array)))))))
On 18 January 2017 at 03:58, Elias Mårtenson <[email protected]> wrote:
> You are right that doing array allocations in any language is slow, and in
> order to get some kind of baseline to compare things to, I implemented the
> same algorithm in Lisp, and I did so in the worst possible way: By
> completely reallocating the array for each row, and copying the old content
> into the new one (code in the end of this message). I did it that way
> because that is what the APL code ends up doing, so I didn't want to give
> the Lisp program any benefits.This was, as you may imagine, very slow. It
> took *51* seconds to load the entire file.
>
> The APL version that I posted in my previous message takes
>
> Now, for a solution. Reading the file twice is quite ugly, and I'd rather
> avoid doing so.
>
> I could rewrite the Lisp program to in-place reallocation of the array,
> which would speed up the program by several orders of magnitude. The
> problem is that there is no mechanism in GNU APL to do the same. I think
> adding a new construct to the language to do that would be ugly, but
> perhaps there is a way for GNU APL to detect the specific “append to the
> end of an array” idiom so that it could use an optimised path that doesn't
> require constructing a new array. *Jürgen*, is that even feasible?
>
> Now, reading a CSV file like this is a very common case, and I am of the
> opinion that having a dedicated native function for this purpose would
> solve perhaps 75% of use cases where this is an issue. I'll be happy to
> implement a flexible function that can do this, which would make GNU APL
> much more pleasant to use for data analysis.
>
> Jürgen, if I were to implement this, would you prefer it to be a
> dynamically loaded library, or would you prefer having it as a
> quad-function?
>
> Regards,
> Elias
>
>
> On 18 January 2017 at 02:33, Nick Lobachevsky <[email protected]> wrote:
>
>> Elias,
>>
>> In just about any language, iterative string catenation is a real
>> performance killer owing to the repeated growing memory allocation
>> which just gets bigger with every iteration.
>>
>> I would restructure the algorithm to work more like this: (sorry, on
>> a Windoze box and no APL, so metacode only)
>>
>> Open the file, read the contents of the entire file, close the file
>> Change NL to separator, then partition the whole thing (Z)
>> Check to see that the length of the partitioned string divides evenly
>> by pattern length
>> Create a temporary of only the numeric items. As they should all be
>> scalar, you should be able to do it in a single execute and get a
>> numeric vector of the correct length.
>> Assign the numeric vector back into Z. You may want to do a ravel
>> each if you would rather have one element vectors
>> Reshape Z to have the same number columns as in pattern
>>
>> Dyalog and some other APLs (APLX?) have a []CSV function. The problem
>> gets a bit harder when you have optional quotes around strings,
>> embedded separators, and have to figure out yourself whether something
>> is numeric or char.
>>
>>
>>
>> On 1/17/17, Elias Mårtenson <[email protected]> wrote:
>> > I wanted to use GNU APL to work on a dataset of star data. The file
>> > consists of 34030 lines of the following form:
>> >
>> > 892376 3813 4.47 0.4699 1.532 0.007 7306.69 0.823 0.4503 0 ---
>> > 1026146 4261 4.57 0.6472 14.891 0.12 11742.56 1.405 0.7229 0 ---
>> > 1026474 4122 4.56 0.5914 1.569 0.006 30471.8 1.204 0.6061 0 ---
>> > 1162635 3760 4.77 0.4497 15.678 0.019 10207.47 0.978 0.5445 1 ---
>> >
>> > I wrote a generic CSV loader to handle this (source code at the end of
>> this
>> > email), and loaded the data like so:
>> >
>> > * z ← 'nnnnnnnnnns' read_csv 'apjs492452t1_mrt.txt'*
>> >
>> > This took many minutes to load, which in my opinion shouldn't happen.
>> >
>> > Now, I have a few questions:
>> >
>> >
>> > 1. Is there a way to speed up this code?
>> > 2. Is there something that could be done on the GNU APL
>> implementation
>> > side to make this faster?
>> > 3. Shouldn't we have a generic ⎕CSV function or something like that
>> > which would be able to load CSV files in milliseconds regardless of
>> > size?
>> > This should be trivial to do in C++.
>> >
>> > Here's the code in question:
>> >
>> > ∇Z ← type convert_entry value
>> > →('n'≡type)/numeric
>> > →('s'≡type)/string
>> > ⎕ES 'Illegal conversion type'
>> > numeric:
>> > Z←⍎value
>> > →end
>> > string:
>> > Z←value
>> > end:
>> > ∇
>> >
>> > ∇Z ← pattern read_csv filename ;fd;line;separator
>> > separator ← ' '
>> > Z ← 0 (↑⍴pattern) ⍴ ⍬
>> > fd ← 'r' FIO∆fopen filename
>> >
>> > next:
>> > line ← FIO∆fgets fd ⍝ Read one line from the file
>> > →(⍬≡line)/end
>> > →(10≠line[⍴line])/skip_nl ⍝ If the line ends in a newline
>> > line ← line[⍳¯1+⍴line] ⍝ Remove the newline
>> > skip_nl:
>> > line ← ⎕UCS line
>> > Z ← Z⍪ pattern convert_entry¨ (line≠separator) ⊂ line
>> > →next
>> > end:
>> >
>> > FIO∆fclose fd
>> > ∇
>> >
>>
>
>