On Fri, 12 Nov 2004 10:05:27 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote: > >>>>> "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes: [...] > this talk about mmap makes little sense to me. it may save some i/o and > even some buffering but you still need the ram and mmap still causes > disk accesses.
Um, mmap does not (well should not - Windows may vary) use any RAM (other than what the memory manager needs to keep track of the fact that the mapping has happened). Using mmap does not imply any particular algorithm, it is an optimization saying that you're going to leave the ugly details of paging the file to/from your process to the OS. mmap should not cause any more or less disk accesses than reading from the file in the same pattern should have. It just lets you do things like use Perl's RE engine directly on the file contents. > if the original file is too big for ram then the > algorithm chosen must be one to minimize disk accesses and mmap doesn't > save those. this is why disk/tape sorts were invented, to minimize the > slow disk and tape accesses. so you would still need my algorithm or > something similar regardless of how you actually get the data from disk > to ram. and yes i have used mmap on many projects. I'm not sure what you mean by "something similar", but yes, you'll need SOME algorithm to solve the problem. Which statement is so general as to be meaningless. I'm sure that there are some possible algorithms that you'd never have thought of. (Mostly because they're bad.) Disk/tape sorts were invented because back in the day there was not enough RAM to do anything useful and so everything had to go to disk. Of course once you're forced to go to disk, why not optimize it...? Of course this problem said to guarantee being able to do the sort, not necessarily to do it most efficiently. Therefore no single criteria - including disk accesses - necessarily MUST dominate your choice. Furthermore disk accesses are not created equal. There are multiple levels of cache between you and disk. Accessing data in a way that is friendly to cache will improve performance greatly. In particular managing to access data sequentially is orders of magnitude faster than jumping around. The key is not how often you "access disk", it is how often your hard drive has to do a seek. When it needs to seek it reads far more data than it is asked for and puts that in cache. When you read sequentially, most of your accesses come from cache, not disk. That is why databases use merge-sort so much, it accesses data in exactly the way that hard drives are designed to be accessed most efficiently. A quick sort has fewer disk accesses, but far more of them cause an unwanted seek. > when analyzing algorithm effienciency you must work out which is the > slowest operation that has the steepest growth curve and work on > minimizing it. since disk access is so much slower than ram access it > becomes the key element rather than the classic comparison in sorts. in You must, must, must. What is this preoccupation with must? As I just pointed out, disk accesses are not all equal. Secondly in many applications you will *parallelize* the slowest step, not minimize it. For instance good databases not only like to use mergesort internally, they often distribute the job to several processes or threads that all work at once, that way if one process is waiting on a disk read, others may be going at the same time. Thirdly, and most importantly, it is more important to make code work than to make it efficient. If a stupid solution will work and a smart one should be faster, code the stupid solution first. > a matrix transposition in ram, i would count the matrix accesses and/or > copies of elements. with a larger matrix, then ram accesses would be > key. my solution would load as much matrix into ram as possible (maybe > using mmap but that is not critical anymore) and transpose it. then > write the section out. that is 2 (large) disk accesses per chunk (or 1 > per disk block). then you do a merge (assuming you can access all the > sction files at one time) which is another disk access per section (or > block). and one more to write out the final matrix (in row order). so > that is O((2 + 2) * section_count) disk accesses which isn't too bad. You said that you want to assume that we can access all section files at once. Well suppose that I take a CSV file which is 100 columns by 10 million rows, transpose it, then try to transpose it again. Your assumption just broke. Maybe it would work for the person with the original problem, maybe not. Here is the outline of a solution that avoids all such assumptions. 1. Run through the CSV file and output a file lines of the format: $column,$row:$field You'll need to encode embedded newlines some way, for instance s/\\/\\\\/g; s/\n/\\n/g; - you may also want to pre-pad the columns and rows with some number of 0's so that an ASCII-betical sort does The Right Thing. 2. Sort the intermediate file by whatever technique you want to wind up with the fields sorted by column and then by row. 3. Run through the sorted file and print the final CSV file. Note that you'll have to undo any mangling done in step 1. Done right this will sort extremely large files with a fixed (small) amount of memory. It probably isn't the fastest possible solution under normal conditions. However for extremely large datasets it is easy to do steps 1 and 3 with simple programs written in C, and highly optimized programs are available for step 2 (that do things like parallelize across a cluster). That may well beat an attempt to be clever. Cheers, Ben _______________________________________________ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm

