>>>>> "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes:

  GS> On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote:
  >> Seriously, while mmap is ideal in C, in Perl I would just build an array
  >> of tell()s for each line in the file and then walk through the lines,
  >> storing the offset of the last delimiter that I'd seen.

  GS> I think mmap would be just as ideal in Perl and a lot less work too.
  GS> Rather than indexing and parsing a *large* file, you must mmap
  GS> and parse it. In fact, the CSV code, which was left as an exercise in you
  GS> pseudo-code, would be the only code required.

  GS> I should point out though that mmap has a 2GB limit on systems
  GS> without 64bit support. Such systems can't store files larger than
  GS> that anyhow.

  >> Let the kernel file buffer do your heavy lifting for you.

  GS> Exactly, if s/kernel file/mmap/

this talk about mmap makes little sense to me. it may save some i/o and
even some buffering but you still need the ram and mmap still causes
disk accesses. if the original file is too big for ram then the
algorithm chosen must be one to minimize disk accesses and mmap doesn't
save those. this is why disk/tape sorts were invented, to minimize the
slow disk and tape accesses. so you would still need my algorithm or
something similar regardless of how you actually get the data from disk
to ram. and yes i have used mmap on many projects.

when analyzing algorithm effienciency you must work out which is the
slowest operation that has the steepest growth curve and work on
minimizing it. since disk access is so much slower than ram access it
becomes the key element rather than the classic comparison in sorts. in
a matrix transposition in ram, i would count the matrix accesses and/or
copies of elements. with a larger matrix, then ram accesses would be
key. my solution would load as much matrix into ram as possible (maybe
using mmap but that is not critical anymore) and transpose it. then
write the section out. that is 2 (large) disk accesses per chunk (or 1
per disk block). then you do a merge (assuming you can access all the
sction files at one time) which is another disk access per section (or
block). and one more to write out the final matrix (in row order). so
that is O((2 + 2) * section_count) disk accesses which isn't too bad.

uri

-- 
Uri Guttman  ------  [EMAIL PROTECTED]  -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs  ----------------------------  http://jobs.perl.org
_______________________________________________
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to