Re: [Boston.pm] transposing rows and columns in a CSV file

Ben Tilly Mon, 15 Nov 2004 15:41:33 -0800

On Mon, 15 Nov 2004 15:58:11 -0500, Aaron Sherman
<[EMAIL PROTECTED]> wrote:
> On Sat, 13 Nov 2004 11:40:25 -0800, Ben Tilly <[EMAIL PROTECTED]> wrote:
> > On Fri, 12 Nov 2004 23:04:46 -0500, Aaron Sherman <[EMAIL PROTECTED]> wrote:
> > > On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote:
> > [...]
> > > > Um, mmap does not (well should not - Windows may vary) use any
> > > > RAM
> 
> > > You are confusing two issues. "using RAM" is not the same as "allocating
> > > process address space".
> 
> > How was I confusing issues?
> 
> Let me demonstrate:
>
> > What I meant is that calling mmap does
> > not use significant amounts of RAM.
> 
> Calling mmap uses NO RAM. It doesn't interact with RAM at all. But
> does allocate (potentially huge) amounts of process address space, and
> reserves it in such a way that your process can no longer allocate it
> for uses like libc's memory allocator (which you access through
> functions like malloc).


I feel like we are all talking past each other.  Let's go back to basics.

When I say RAM I mean the physical RAM on the computer.
Whether or not that RAM is currently allocated to your process.  So
if you do something and that makes something else get paged out,
then you've used RAM in my view.  Whether that RAM is in pages
that are attached to your process, or was used by the kernel I still
see that as you using RAM.

> If you mmap a 3GB file (actually less than 3GB, but I'll use that
> number as an example for now) on an x86 linux box and then call
> "malloc", you get back a NULL pointer because malloc will fail. This
> is actually not quite true. That malloc will likely work because it
> will be allocated from some existing page of address space that
> malloc's internal page allocator reserved before you called mmap, but
> that won't work for long.

I'm aware of this and wasn't disputing it.

> >  (The OS needs some to track
> > that the mapping exists, but that should be it.)
> 
> Actually, no. The place that mmap is tracked is a) in the file
> descriptor table, which is outside of your 3GB process space in
> kernel-space and b) in the system page table, which is not in your
> address space at all, but in hardware.

Where it is tracked doesn't concern me.  That it is tracked, does.

However I realize that I don't know enough about how the memory
management is handled.  I would think that this would be dynamic
in some way - on creating a process the kernel should need to
write very little data, but will then write more later.  But I don't know
enough to verify that one way or the other.

> > Once you actually use
> > the data that you mmapped in, file contents will be swapped in, and
> > RAM will be taken, but not until then.
> 
> "RAM will be taken" is a meaningless term here. Ignore RAM for
> purposes of this conversation.

On the one hand you are saying that I'm confused about what I
meant by a comment about using RAM, and on the other you
are telling me that I am to ignore RAM for the purposes of this
conversation, it is meaningless.  There is a contradiction there.
For discussing what *you* want to talk about it may be
meaningless, but for discussing what *I* had been talking
about it isn't.  And for deciding whether or not I was confused
it most definitely isn't.  (Perhaps you're confused about what I
was talking about?)

It appears that you want to discuss what the world looks like
to a process.  For that I wholeheartedly agree, talking about
what is in RAM is generally counterproductive, if the
abstraction of virtual memory works, then you should never
know or care about what is or is not in RAM.

But I was talking about what things lead to resource
consumption that could adversely affect a machine which is
carrying out a particular computation.  For that it matters a
great deal whether particular operations are going to cause
pages of RAM to be discarded and allocated for something
else.  Because when it comes to actual performance, the
abstraction of virtual memory leaks badly.

(And what I was saying about resource consumption is that
mmap doesn't.  Consume in meaningful amounts that is.)

> > As for a 3 GB limit, now that you mention it, I heard something
> > about that.  But I didn't pay attention since I don't need it right now.
> 
> Suffice to say that your process cannot be larger than 3GB under x86
> Linux. There are extensions, options and hacks if you want to go
> larger, but after 3GB it gets very dicey.

Are you saying that Linux does not give an user-level API to Intel's
addressing extensions?  Or that it does but you recommend
against using it?

> > I've also heard about Intel's large addressing extensions (keep 2GB
> > in normal address space, page around the top 2 GB, you get 64 GB
> > of addressible memory).  I'm curious about how (or if) the two can
> > cooperate.
> 
> The ability to re-map memory like this is quite common, and the *OS*
> can take advantage of it, but as long as you're on an x86 and using
> 32-bit pointers, your one process can still only have 3GB of address
> space (4GB-1GB for system area). But it could, for example, talk to
> several other processes which each have their own 3GB address spaces
> and pass around shared memory segments on-demand.

I know that the OS can (and many do) take advantage of Intel's
Physical Addressing Extensions (PAE).  However there is no reason
that the OS can't offer a user-level API to it.  For instance Microsoft
does under the name Addressing Window Extensions (AWE).  See
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/address_windowing_extensions.asp

What I was wondering is whether Linux has or plans to have any
such user-level API, and if they do then whether there will be a
conflict between that API and the offer of 3 GB of addressable
memory per process.

> > However the over-committed allocation comment confuses me.
> > Why would a single mmap result in over committing memory?
> 
> Ignore the over-commit comment, it's kind of confusing to the core
> point. Assume you always have that option set and this conversation is
> easier to have.

I'll ignore it.  However before doing that I'll comment that in
this discussion I was thinking of loading a file by using a mmap
with MAP_SHARED.  If you load one with MAP_PRIVATE then I can
easily see where overcommitting comes in.

> > > Actually, no it doesn't as far as I know (unless the copy-on-write code
> > > got MUCH better recently).
> >
> > Where does a write happen?  I was thinking in terms of using the
> > RE engine (with pos) as a tokenizer.
> 
> I understand. I was not taliking about writes. Copy-on-write is only
> useful if you DON'T write to most of the data in question (and thus,
> you never have to copy it). The point was that unless Perl's CoW has
> gotten much better since  looked, matching against the data in an mmap
> segment will require that at some point, you copy it. Please, someone
> correct me if that's not current.

I think I see your point.  If you use capturing groups, Perl copies
the captured data.  Ditto for $', $& and $` if those are being
populated.  (Hence the performance problems with ever using
them.)  So over walking the entire file, you will wind up copying
the whole file, token by token.

I'd be amazed if that wasn't still the case.

However even if that copy is still there, you've still saved
copying from the file into your memory space.  So even though
you could save more, you've still saved something.  (And, as I
said originally, you actually can run the RE engine directly
against a file's contents.)

> > > Like I said, you probably won't get the win out of mmap in Perl that you
> > > would expect. In Parrot you would, but that's another story.

My expectations appear to be lower than yours. :-)

> > In Perl I'd expect it to be possible but fragile.  If Parrot could make
> > it possible and not fragile, that would be great.
> 
> In parrot it's quite robust. Parrot supports "buffers" as core PMC
> types. A buffer can refer to any part of memory with any read-only or
> copy-on-write semantics you like.

That would be nice.

Incidentally will Parrot also support efficiently building strings
incrementally?  I like the fact that in Perl 5 it is O($n) to do
something like:

  $string .= "hello" for 1..$n;

In most other languages that is quadratic, and I'm wondering
what to expect in Perl 6.

Cheers,
Ben
_______________________________________________
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] transposing rows and columns in a CSV file

Reply via email to