Andi Kleen wrote:
[Sorry for the late answer.]


No problem, remember, I'm supposed to be on vacation, anyway. :-)

Let's start off with at least one thing we can agree on.  If xattrs
are already part of XFS, then it seems reasonable to use an extended
attribute to mark certain files as non-migratable.   (Some further
thought is going to be required here -- r/o sections of a
shared library need not be migrated, but r/w sections containing
program or thread private data would need to be migrated.  So
the extended attribute may be a little more complicated than
just "don't migrate".)

The fact that NFS doesn't support this means that we will have to
have some other way to handle files from NFS though.  It is possible
we can live with the notion that files mapped in from NFS are always
migratable.  (I'll need to look into that some more).

On Tue, Feb 15, 2005 at 09:44:41PM -0600, Ray Bryant wrote:


Sorry, but the only real difference between your API and mbind is that
yours has a pid argument.



OK, so I've "lost the thread" a little bit here. Specifically what would you propose the API for page migration be? As I read through your note, I see a couple of different possibilities being considered:

(1)  Map each object to be migrated into a management process,
     update the object's memory policy to match the new node locations
     and then unmap the object.  Use the MPOL_F_STRICT argument to mbind() and
     the result is that migration happens as part of the call.

(2)  Something along the lines of:

     page_migrate(pid, old_node, new_node);

     or perhaps

     page_migrate(pid, old_node_mask, new_node_mask);

or

(3)  mbind() with a pid argument?

I'm sorry to be so confused, but could you briefly describe what
your proposed API would be (or choose from the above list if I
have guessed correctly?)  :-)



The fundamental disconnect here is that I think that very few
programs use the NUMA API, and you think that most programs do.


All programs use NUMA policy (assuming you have a CONFIG_NUMA kernel) Internally it's all the same.

Well, yes, I guess to be more precise I should have said that very few programs use any NUMA policy other than the DEFAULT policy. And that they instead make page placement decisions implicitly using first touch.


Hmm, I see perhaps my distinction of these cases with programs already using NUMA API and not using it was not very useful and lead you to a tangent. Perhaps we can just drop it.

I think one problem that you have that you essentially
want to keep DEFAULT policy, but change the nodes.

Yes, that is correct. This has been exactly my point from the beginning.

We have programs that use the DEFAULT policy and do placement
by first touch, and we want to migrate  those programs without
requiring them to create a non-DEFAULT policy of some kind.

NUMA API currently doesn't offer a way to do that, not even with Steve's patch that does simple page migration.
You only get a migration when you set a BIND or PREFERED
policy, and then it would stay. But I guess you could
force that and then set back DEFAULT. It's a big ugly,
but not too bad.



Very ugly, I think. Particularly if you have to do a lot of vma splitting to get the correct node placement. (Worst case is a VMA with nodes interleaved by first touch across a set of nodes in a way that doesn't match the INTERLEAVE mempolicy. Then you would have to create a separate VMA for each page and use the BIND policy. Then after migration you would have to go through and set the policy back to DEFAULT, resulting in a lot of vma merges.)



Sure, but NUMA API goes to great pains to handle such programs.


Yes, it does. But, how do we handle legacy NUMA codes that people use today on our Linux 2.4.21 based Altix kernels? Such programs don't have access to the NUMA API, so they aren't using it. They work fine on 2.6 with the DEFAULT memory policy. It seems unreasonable to go back and require these programs to use "numactl" to solve a problem that they are already solving without it. And it certainly seems difficult to require them to use numactl to enable migration of those programs.

(I'm sorry to keep harping on this but I think this is the
heart of the issue we are discussing.  Are you of the opinion that
we sould require every program that runs on ALTIX under Linux 2.6 to use 
numactl?)

So lets go with the idea of dropping the va_start and va_end arguments from
the system call I proposed.  Then, we get into the kernel and starting


That would make the node array infinite, won't it?  What happens when
you want to migrate a 1TB process? @) I think you have to replace
that one with a single target node argument too.


I'm sorry, I don't follow that at all. The node array has nothing to do with the size of the address range to be migrated. It is not the case that the ith entry in the node array says what to do with the ith page. Instead the old and new node arrays defining a mapping of pages: for pages found on old_node[i], move them to new_node[i]. The count field is the size of those arrays, not the size of the region being migrated.


You and Robin mentioned some problems with "double migration"
with that, but it's still not completely clear to me what
problem you're solving here. Perhaps that needs to be reexamined.


I think the issue here is basically a scalability issue. The problem we have with most of the proposals that suggest passing in just a single target and destination node, e. g.:

sys_page_migrate(pid, old_node, new_node);

is that this is basically an O(N**2) operation if N is the number of
processes in the job.  How do we get that?  Well, we have to make the
above system call M times for each process, where M is the number of
nodes being migrated.  In our environment, we typically pin each process
to a different processor.  So we have to make M*N system calls.  But
we have two processors per node, so really that is N**2/2 system calls,
hence it is O(N**2).

(To simplify the dicussion, I am making an implicit assumption here
that the majority  of the memory involved here is shared among all
N processes.)

Now once a particular shared object has been migrated, then the migration
code won't do any additional work, but we will still scan over the page
table entries once per each system call, so that is the component that
is done O(N**2) times.  It is likely a much smaller component of the
system call than the migration code, which is O(p) in the number of
pages p, but that N**2 factor is a little scary when you have N=128
and the number of pages large.

So what we mean by the phrase "double migration" is the process
of scanning the page tables a second time to try to migrate an
object that a previous call has already moved the underlying pages
for.

Compare this to the system call:

sys_page_migrate(pid, count, old_node_list, new_node_list);

We are then down to O(N) system calls and O(N) page table scans.

But we can do better than that.  If  we have a system  call
of the form

sys_page_migrate(pid, va_start, va_end, count, old_node_list, new_node_list);

and the majority of the memory is shared, then we only need to make
one system call and one page table scan. (We just "migrate" the
shared object once.) So the time to do the page table scans disappears
from interest and the only thing that takes the time is the migration code itself. (Remember, I making a big simplifying assumption here that most of
the memory is shared among all of the processes of the job, so there is
basically just one object to migrate, so just one system call. The obvious
generalization to the real case is left as an exercise for the
reader. :-) )


Passing in an old_node/new_node mask solves this part of the
problem as well, but we have some concerns with that approach as well,
but that is a detail for a later email exchange.

--
-----------------------------------------------
Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
[EMAIL PROTECTED]             [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
         so I installed Linux.
-----------------------------------------------

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to