On Fri, 16 Jun 2006, Jacob Yocom-Piatt wrote:

> i've got some C code that is reading from a 800 MB CSV file and allocates 
> memory
> for an array to store the data in. the method used is to read the CSV file
> line-by-line and realloc additional space with each line read. having timed 
> this
> and found the realloc speed to be low when the array is large, i am aiming to
> make this faster but am not sure about the best way to proceed.
> 
> the current code uses realloc in the manner suggested by the manpage:
> 
>     newsize = size + 1;
>     time(&t1);                  // start timing realloc
>     if ((newap = (int *)realloc(ap, newsize*sizeof(int))) == NULL) {
>         free(ap);
>         ap = NULL;
>         size = 0;
>         return (NULL);
>     }
>     time(&t2);                  // stop timing realloc; start timing fscanf
> 
> as the size of ap grows, so does the time it takes to realloc the space.
> 
> an alternative to this procedure would be to scan through the CSV file to
> determine how many array entries i would need, realloc it all at once, then go
> back through the CSV file again to read the data into the array. i'm not
> confident this is the only way to do this and would appreciate any suggestions
> for speeding up this procedure.

You'll have to think how this works: realloc has to copy the data to
the newly allocated region if it is not able to extend the current
region. 

If you do that for each time you'll increase the allocation size and
your size increment is small, you'll do a lot of work. The way you are
doing it now has quadratic time complexity.

So make your increment larger and start with a larger size. Maybe you
can estimate the initial size based on your file size. If you'll end
up allocating a too large area, just use realloc the decrease the size
after you're done.

Also thing about other data structures: you might be better of with a
linked list or tree here, depending on what you are doing with your
data.

        -Otto

Reply via email to