On Mon, 2008-03-31 at 13:37 +0200, Roland Mainz wrote:
> Rod Evans wrote:
> >
> > I'm sponsoring the following case for Mike Corcoran. Time out 04/07/08.
> >
> > The case introduces a new system call, mmapfd(2). This call is primarily
> > targeted for use by ld.so.1(1), and provides for the efficient mapping of
> > ELF files (and 4.x AOUT files).
> >
> > Release Binding: Patch/Micro
> > mmapfd(2): Consolidation Private
> >
> > --------------------------------------------------------------------------
> >
> > 1. Introduction
> > 1.1. Project/Component Working Name:
> > mmapfd: mmap file descriptor
> [snip]
> > 4. Technical Description:
> > 4.1. Details:
> > mmapfd is a new system call which can interpret and map ELF and
> > AOUT
> > (4.x) objects. This system call allows the interpretation and
> > mapping
> > of ELF and AOUT files to be carried out completely by the kernel
> > rather
> > than by ld.so.1.
>
> Erm... the call seems to be ELF+AOUT-specifc - why does it have such a
> generic name ?
It has a generic name for future expansion.
> The call can't handle other types of executables (e.g.
> "javaexec" or "shbinexec")
Not yet, but that doesn't mean that it won't. If there is a file format
that needs to be interpreted in order to be mapped, this new syscall is
a good way of going about doing that.
> nor does it seem to be usefull to map normal
> files...
Also not yet. You can map a whole file read only and we've talked about
allowing mapping read/write ... but I have questions about why someone
would use this interface over mmap(2) when dealing with files which
don't need interpretation. mmapfd(2) only deals with whole files while
mmap(2) lets you map a range within a file. The burning question is
"Why would someone use mmapfd(2) over mmap(2) for mapping a file?" I
don't have a good answer for that yet. I went down the path of
implementing just that but right now it seems like a feature no one
would use and thus I don't want to commit to an interface for that. A
future phase of this project can address that issue.
> ... what renaming the call to |mmapexecfd()| (= "memory map of
> executable fd") ?
>
I like Darren's suggestion of mmapv(2) better so far. Why limit us to
only being able to interpret executable files? What if there's some
other non-executable file type that would naturally use this interface?
> > mmapfd also provides for mapping a whole file, without
> > interpretation
> > in a read only mode.
>
> What does that mean ? Can these data+MMU mappings be shared between
> processes ?
>
It means that you can map a file read-only without doing any
interpretation of the file. Thus if you passed in an ELF file without
the MMFD_INTERPRET flag, the file would get mapped as a single read-only
segment. The data+MMU mappings would not be shared between processes.
> [snip]
> > System Calls mmapfd(2)
> >
> > NAME
> >
> > mmapfd - map a file descriptor in the appropriate manner.
> >
> > SYNOPSIS
> >
> > #include <sys/mman.h>
> >
> > int
> > mmapfd(int fd, uint_t flags, mmapfd_result_t *storage,
> > uint_t *elements, void *arg)
>
> Uhm... how do I unmap the mapping done by |mmapfd()| ?
>
Nico answered this in a subsequent post and you would use munmap(2) on
each element in the "storage" array to unmap that segment. I agree that
this should be explicitly pointed out somewhere since it's not clear.
> > DESCRIPTION
> >
> > The mmapfd() function establishes a set of mappings between a process's
> > address space and a file. By default, mmapfd maps the whole file as a
> > single,
> > private, read-only mapping. The MMFD_INTERPRET flag instructs mmapfd to
> > attempt to interpret the file and map it according to the rules for that
> > file
> > format. Currently only the following ELF and AOUT formats are supported.
>
> What will happen if the file is executable but uses an unsupported
> format (e.g. "javaexec", "shbinexec" etc.) ?
>
If the MMFD_INTERPRET flag is specified and it's not ELF or AOUT (eg.
"javaexec", "shbinexec", ...) then EINVAL will be returned as specified
later in the man page. If the MMFD_INTERPRET flag is not set, then the
file will get mapped as a single read-only segment.
> > ET_EXEC and AOUT executables
> > Result in one or more mappings whose size, alignment and
> > protections
> > are as described by the files program header information. The
> > address
> > of each mapping is explicitly defined by the files program headers.
> >
> > ET_DYN and AOUT shared objects
> > Result in one or more mappings whose size, alignment and
> > protections
> > are as described by the files program header information. The base
> > address of the initial mapping is obtained by mapfd(). The
> > address of
> > adjacent mappings are based off of this base address as explicitly
> > defined by the files program headers.
> >
> > ET_REL and ET_CORE
> > Result in a single, read-only mapping. The base address of this
> > mapping is obtained by mmapfd().
> >
> > mmapfd will not map over any currently used mappings within the process
> > except for the case of an ELF file for which a previous reservation has
> > been
> > made via /dev/null.
> >
> > PARAMETERS
> >
> > fd The open file descriptor for the file to be mapped.
> >
> > flags Indicates that the default behavior of mmapfd should be modified
> > accordingly. Available flags are MMFD_INTERPRET and MMFD_PADDING.
> >
> > storage
> > A pointer to the mmapfd_result_t array where the mapping
> > data will be copied out after a successful mapping of fd.
> >
> > elements
> > A pointer to the number of mmapfd_result_t elements pointed to by
> > storage. On return, elements contains the number of mappings
> > required
> > to fully map the requested object. If the original value of
> > elements was too small, an error will be returned, and elements
> > will be modified to contain the number of mappings necessary.
> >
> > arg A pointer to additional information that might be associated with
> > the
> > specific request. Presently, only the MMFD_PADDING request uses
> > this
> > argument. In this case, args should be a pointer to size_t that
> > indicates how much padding is requested. This amount of padding is
> > added before the first mapping and immediately after the last
> > mapping.
> >
> > FLAGS
> >
> > MMFD_INTERPRET
> > Interpret the contents of the file descriptor instead of just
> > mapping a
> > single image. Can only be used with ELF and AOUT files.
> >
> > MMFD_PADDING
> > When mapping in the file descriptor, padding of the amount pointed
> > to by
> > arg is requested before the lowest mapping and after the highest
> > mapping.
> >
> > TYPES USED
> >
> > typedef struct {
> > caddr_t mr_addr; /* mapping address */
> > size_t mr_msize; /* mapping size */
> > size_t mr_fsize; /* file size */
> > size_t mr_offset; /* offset into file */
> > int mr_prot; /* the protections provided */
>
> Why is this |signed int| ?
I agree it should be uint_t or even uchar_t. Both appear to be commonly
used throughout the kernel. uint_t is my preference here.
>
> > uint_t mr_flags; /* info on the mapping */
>
> Please change this to |uint64_t| (e.g. it may be nice to have more flags
> available by default).
>
I go back and forth on this. uint64_t implies that 32 flags will not be
enough. 32 seems like a lot but since this interface should be around
for a long time, maybe 64 would be better since it will take a long time
to use all 64 :)
> > } mmapfd_result_t;
> >
> > Values for mr_flags include:
> >
> > MFD_ELF_HDR 0x1 /* the ELF header is mapped at mr_addr */
> > MFD_AOUT_HDR 0x2 /* the AOUT header is mapped at mr_addr */
>
> What about reserving the first four bits for |MFD_*_HDR| flags
> (|MFD_ELF_HDR|, |MFD_AOUT_HDR| and two bits reserved for future
> |MFD_*_HDR| flags) ?
>
I can see the vanity reason for doing this, it's nice to clump things in
groups of 4 with everything in the group being related, but at the same
time, why the first 4 flags and not the first 8 or 16 or ... I think
densely packing the bits is easiest. I'm willing to move MFD_PADDING
first in the list aka 0x1 so that future header types will follow
numerically from the current header types. I'll make this change since
it is aesthetically more pleasing.
> Finally... how can I _force_ the mapping to use something like 64k pages
> by default ?
Using mpss.so.1 or ppgsz are ways to get the heap segment to be a
specific size. Other than that, there is no control for the page size
used by the other segments and the kernel will pick the best size for
the given platform.
Thanks for the comments,
Mike
>
> ----
>
> Bye,
> Roland
>