>>>>> "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes:
GS> On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote: >> Seriously, while mmap is ideal in C, in Perl I would just build an array >> of tell()s for each line in the file and then walk through the lines, >> storing the offset of the last delimiter that I'd seen. GS> I think mmap would be just as ideal in Perl and a lot less work too. GS> Rather than indexing and parsing a *large* file, you must mmap GS> and parse it. In fact, the CSV code, which was left as an exercise in you GS> pseudo-code, would be the only code required. GS> I should point out though that mmap has a 2GB limit on systems GS> without 64bit support. Such systems can't store files larger than GS> that anyhow. >> Let the kernel file buffer do your heavy lifting for you. GS> Exactly, if s/kernel file/mmap/ this talk about mmap makes little sense to me. it may save some i/o and even some buffering but you still need the ram and mmap still causes disk accesses. if the original file is too big for ram then the algorithm chosen must be one to minimize disk accesses and mmap doesn't save those. this is why disk/tape sorts were invented, to minimize the slow disk and tape accesses. so you would still need my algorithm or something similar regardless of how you actually get the data from disk to ram. and yes i have used mmap on many projects. when analyzing algorithm effienciency you must work out which is the slowest operation that has the steepest growth curve and work on minimizing it. since disk access is so much slower than ram access it becomes the key element rather than the classic comparison in sorts. in a matrix transposition in ram, i would count the matrix accesses and/or copies of elements. with a larger matrix, then ram accesses would be key. my solution would load as much matrix into ram as possible (maybe using mmap but that is not critical anymore) and transpose it. then write the section out. that is 2 (large) disk accesses per chunk (or 1 per disk block). then you do a merge (assuming you can access all the sction files at one time) which is another disk access per section (or block). and one more to write out the final matrix (in row order). so that is O((2 + 2) * section_count) disk accesses which isn't too bad. uri -- Uri Guttman ------ [EMAIL PROTECTED] -------- http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org _______________________________________________ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm

