Re: [Bug-apl] Performance problems when constructing large(ish) arrays

Elias Mårtenson Tue, 17 Jan 2017 12:03:50 -0800

Sorry, I forgot to paste in the runtime for the APL version. It's
approximately *6:30*. That's only 7 times slower than the Lisp version
doing the same thing which is not fantastic, but acceptable I suppose.


Here's the source to the Lisp version. As you can see, it tries really hard
to replicate the behaviour of the APL version, with all that copying.

(defparameter *result*
           (time
            (with-open-file (s "apjs492452t1_mrt.txt")
              (let ((result (make-array '(0 11))))
                (loop
                  for line = (read-line s nil)
                  while line
                  do (let ((parts (split-sequence:split-sequence #\Space
line :remove-empty-subseqs t))
                           (new-array (make-array (list (1+
(array-dimension result 0)) 11)))
                           (n (reduce #'* (array-dimensions result))))
                       (loop for i from 0 below n
                             do (setf (row-major-aref new-array i)
(row-major-aref result i)))
                       (loop for i from n repeat 10
                             for p in parts
                             do (setf (row-major-aref new-array i)
(parse-number:parse-number p)))
                       (setf (row-major-aref new-array (+ n 10)) (nth 10
parts))
                       (setq result new-array)))))))

On 18 January 2017 at 03:58, Elias Mårtenson <[email protected]> wrote:

> You are right that doing array allocations in any language is slow, and in
> order to get some kind of baseline to compare things to, I implemented the
> same algorithm in Lisp, and I did so in the worst possible way: By
> completely reallocating the array for each row, and copying the old content
> into the new one (code in the end of this message). I did it that way
> because that is what the APL code ends up doing, so I didn't want to give
> the Lisp program any benefits.This was, as you may imagine, very slow. It
> took *51* seconds to load the entire file.
>
> The APL version that I posted in my previous message takes
>
> Now, for a solution. Reading the file twice is quite ugly, and I'd rather
> avoid doing so.
>
> I could rewrite the Lisp program to in-place reallocation of the array,
> which would speed up the program by several orders of magnitude. The
> problem is that there is no mechanism in GNU APL to do the same. I think
> adding a new construct to the language to do that would be ugly, but
> perhaps there is a way for GNU APL to detect the specific “append to the
> end of an array” idiom so that it could use an optimised path that doesn't
> require constructing a new array. *Jürgen*, is that even feasible?
>
> Now, reading a CSV file like this is a very common case, and I am of the
> opinion that having a dedicated native function for this purpose would
> solve perhaps 75% of use cases where this is an issue. I'll be happy to
> implement a flexible function that can do this, which would make GNU APL
> much more pleasant to use for data analysis.
>
> Jürgen, if I were to implement this, would you prefer it to be a
> dynamically loaded library, or would you prefer having it as a
> quad-function?
>
> Regards,
> Elias
>
>
> On 18 January 2017 at 02:33, Nick Lobachevsky <[email protected]> wrote:
>
>> Elias,
>>
>> In just about any language, iterative string catenation is a real
>> performance killer owing to the repeated growing memory allocation
>> which just gets bigger with every iteration.
>>
>> I would restructure the algorithm to work more like this:  (sorry, on
>> a Windoze box and no APL, so metacode only)
>>
>> Open the file, read the contents of the entire file, close the file
>> Change NL to separator, then partition the whole thing (Z)
>> Check to see that the length of the partitioned string divides evenly
>> by pattern length
>> Create a temporary of only the numeric items.  As they should all be
>> scalar, you should be able to do it in a single execute and get a
>> numeric vector of the correct length.
>> Assign the numeric vector back into Z.  You may want to do a ravel
>> each if you would rather have one element vectors
>> Reshape Z to have the same number columns as in pattern
>>
>> Dyalog and some other APLs (APLX?) have a []CSV function.  The problem
>> gets a bit harder when you have optional quotes around strings,
>> embedded separators, and have to figure out yourself whether something
>> is numeric or char.
>>
>>
>>
>> On 1/17/17, Elias Mårtenson <[email protected]> wrote:
>> > I wanted to use GNU APL to work on a dataset of star data. The file
>> > consists of 34030 lines of the following form:
>> >
>> >   892376 3813 4.47 0.4699  1.532  0.007    7306.69 0.823 0.4503 0 ---
>> >  1026146 4261 4.57 0.6472 14.891  0.12    11742.56 1.405 0.7229 0 ---
>> >  1026474 4122 4.56 0.5914  1.569  0.006   30471.8  1.204 0.6061 0 ---
>> >  1162635 3760 4.77 0.4497 15.678  0.019   10207.47 0.978 0.5445 1 ---
>> >
>> > I wrote a generic CSV loader to handle this (source code at the end of
>> this
>> > email), and loaded the data like so:
>> >
>> > *    z ← 'nnnnnnnnnns' read_csv 'apjs492452t1_mrt.txt'*
>> >
>> > This took many minutes to load, which in my opinion shouldn't happen.
>> >
>> > Now, I have a few questions:
>> >
>> >
>> >    1. Is there a way to speed up this code?
>> >    2. Is there something that could be done on the GNU APL
>> implementation
>> >    side to make this faster?
>> >    3. Shouldn't we have a generic ⎕CSV function or something like that
>> >    which would be able to load CSV files in milliseconds regardless of
>> > size?
>> >    This should be trivial to do in C++.
>> >
>> > Here's the code in question:
>> >
>> > ∇Z ← type convert_entry value
>> >   →('n'≡type)/numeric
>> >   →('s'≡type)/string
>> >   ⎕ES 'Illegal conversion type'
>> > numeric:
>> >   Z←⍎value
>> >   →end
>> > string:
>> >   Z←value
>> > end:
>> > ∇
>> >
>> > ∇Z ← pattern read_csv filename ;fd;line;separator
>> >   separator ← ' '
>> >   Z ← 0 (↑⍴pattern) ⍴ ⍬
>> >   fd ← 'r' FIO∆fopen filename
>> >
>> > next:
>> >   line ← FIO∆fgets fd           ⍝ Read one line from the file
>> >   →(⍬≡line)/end
>> >   →(10≠line[⍴line])/skip_nl     ⍝ If the line ends in a newline
>> >   line ← line[⍳¯1+⍴line]        ⍝ Remove the newline
>> > skip_nl:
>> >   line ← ⎕UCS line
>> >   Z ← Z⍪ pattern convert_entry¨ (line≠separator) ⊂ line
>> >   →next
>> > end:
>> >
>> >   FIO∆fclose fd
>> > ∇
>> >
>>
>
>

Re: [Bug-apl] Performance problems when constructing large(ish) arrays

Reply via email to