Re: [Boston.pm] transposing rows and columns in a CSV file

Ben Tilly Fri, 12 Nov 2004 13:23:46 -0800

On Fri, 12 Nov 2004 10:05:27 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote:
> >>>>> "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes:
[...]
> this talk about mmap makes little sense to me. it may save some i/o and
> even some buffering but you still need the ram and mmap still causes
> disk accesses.


Um, mmap does not (well should not - Windows may vary) use any
RAM (other than what the memory manager needs to keep track of
the fact that the mapping has happened).  Using mmap does not
imply any particular algorithm, it is an optimization saying that
you're going to leave the ugly details of paging the file to/from your
process to the OS.

mmap should not cause any more or less disk accesses than
reading from the file in the same pattern should have.  It just lets
you do things like use Perl's RE engine directly on the file
contents.

>                     if the original file is too big for ram then the
> algorithm chosen must be one to minimize disk accesses and mmap doesn't
> save those. this is why disk/tape sorts were invented, to minimize the
> slow disk and tape accesses. so you would still need my algorithm or
> something similar regardless of how you actually get the data from disk
> to ram. and yes i have used mmap on many projects.

I'm not sure what you mean by "something similar", but yes,
you'll need SOME algorithm to solve the problem.  Which
statement is so general as to be meaningless.  I'm sure that
there are some possible algorithms that you'd never have
thought of.  (Mostly because they're bad.)

Disk/tape sorts were invented because back in the day there
was not enough RAM to do anything useful and so everything
had to go to disk.  Of course once you're forced to go to disk,
why not optimize it...?

Of course this problem said to guarantee being able to do the
sort, not necessarily to do it most efficiently.  Therefore no
single criteria - including disk accesses - necessarily MUST
dominate your choice.  Furthermore disk accesses are not
created equal.  There are multiple levels of cache between
you and disk.  Accessing data in a way that is friendly to cache
will improve performance greatly.

In particular managing to access data sequentially is orders of
magnitude faster than jumping around.  The key is not how
often you "access disk", it is how often your hard drive has to
do a seek.  When it needs to seek it reads far more data than
it is asked for and puts that in cache.  When you read
sequentially, most of your accesses come from cache, not
disk.  That is why databases use merge-sort so much, it
accesses data in exactly the way that hard drives are designed
to be accessed most efficiently.  A quick sort has fewer disk
accesses, but far more of them cause an unwanted seek.

> when analyzing algorithm effienciency you must work out which is the
> slowest operation that has the steepest growth curve and work on
> minimizing it. since disk access is so much slower than ram access it
> becomes the key element rather than the classic comparison in sorts. in

You must, must, must.  What is this preoccupation with must?

As I just pointed out, disk accesses are not all equal.

Secondly in many applications you will *parallelize* the
slowest step, not minimize it.  For instance good databases
not only like to use mergesort internally, they often distribute
the job to several processes or threads that all work at once,
that way if one process is waiting on a disk read, others may
be going at the same time.

Thirdly, and most importantly, it is more important to make
code work than to make it efficient.  If a stupid solution will
work and a smart one should be faster, code the stupid
solution first.

> a matrix transposition in ram, i would count the matrix accesses and/or
> copies of elements. with a larger matrix, then ram accesses would be
> key. my solution would load as much matrix into ram as possible (maybe
> using mmap but that is not critical anymore) and transpose it. then
> write the section out. that is 2 (large) disk accesses per chunk (or 1
> per disk block). then you do a merge (assuming you can access all the
> sction files at one time) which is another disk access per section (or
> block). and one more to write out the final matrix (in row order). so
> that is O((2 + 2) * section_count) disk accesses which isn't too bad.

You said that you want to assume that we can access all section
files at once.  Well suppose that I take a CSV file which is 100
columns by 10 million rows, transpose it, then try to transpose it
again.  Your assumption just broke.  Maybe it would work for the
person with the original problem, maybe not.

Here is the outline of a solution that avoids all such assumptions.

1. Run through the CSV file and output a file lines of the format:
  $column,$row:$field
You'll need to encode embedded newlines some way, for instance
s/\\/\\\\/g; s/\n/\\n/g; - you may also want to pre-pad the columns and
rows with some number of 0's so that an ASCII-betical sort does
The Right Thing.

2. Sort the intermediate file by whatever technique you want to wind
up with the fields sorted by column and then by row.

3. Run through the sorted file and print the final CSV file.  Note
that you'll have to undo any mangling done in step 1.

Done right this will sort extremely large files with a fixed
(small) amount of memory.  It probably isn't the fastest
possible solution under normal conditions.  However for
extremely large datasets it is easy to do steps 1 and 3 with
simple programs written in C, and highly optimized
programs are available for step 2 (that do things like
parallelize across a cluster).  That may well beat an
attempt to be clever.

Cheers,
Ben
_______________________________________________
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] transposing rows and columns in a CSV file

Reply via email to