On Fri, Jun 16, 2006 at 10:14:07PM +0200, Otto Moerbeek wrote:
> On Fri, 16 Jun 2006, Jacob Yocom-Piatt wrote:
> 
> > i've got some C code that is reading from a 800 MB CSV file and allocates 
> > memory
> > for an array to store the data in. the method used is to read the CSV file
> > line-by-line and realloc additional space with each line read. having timed 
> > this
> > and found the realloc speed to be low when the array is large, i am aiming 
> > to
> > make this faster but am not sure about the best way to proceed.
> > 
> > the current code uses realloc in the manner suggested by the manpage:
> > 
> >     newsize = size + 1;
> >     time(&t1);                  // start timing realloc
> >     if ((newap = (int *)realloc(ap, newsize*sizeof(int))) == NULL) {
> >         free(ap);
> >         ap = NULL;
> >         size = 0;
> >         return (NULL);
> >     }
> >     time(&t2);                  // stop timing realloc; start timing fscanf
> > 
> > as the size of ap grows, so does the time it takes to realloc the space.
> > 
> > an alternative to this procedure would be to scan through the CSV file to
> > determine how many array entries i would need, realloc it all at once, then 
> > go
> > back through the CSV file again to read the data into the array. i'm not
> > confident this is the only way to do this and would appreciate any 
> > suggestions
> > for speeding up this procedure.
> 
> You'll have to think how this works: realloc has to copy the data to
> the newly allocated region if it is not able to extend the current
> region. 
> 
> If you do that for each time you'll increase the allocation size and
> your size increment is small, you'll do a lot of work. The way you are
> doing it now has quadratic time complexity.
> 
> So make your increment larger and start with a larger size. Maybe you
> can estimate the initial size based on your file size. If you'll end
> up allocating a too large area, just use realloc the decrease the size
> after you're done.
> 
> Also thing about other data structures: you might be better of with a
> linked list or tree here, depending on what you are doing with your
> data.
> 
>       -Otto

Also, if the purpose of this is to load the file in memory, why not just
mmap() it in the first place ?

Reply via email to