[caiman-discuss] Investigation and recommendation on using Python's multiprocessing module for running checkpoint

Karen Tung Thu, 07 Oct 2010 16:49:48 -0700

 This is a write up of my investigation and recommendation on using
Python's multiprocessing module to run checkpoints in the engine.


Background:
-------------

Currently, the engine runs checkpoints sequentially in a separate thread.

As discussed during the engine's code review, using the multiprocessingmodule

will allow us to have better control on the checkpoints.  For example,
the engine can kill checkpoints instead of relying on checkpoints to
behave correctly when a cancel request is sent to it.  Using the
multiprocessing module also provides the benefit of achieving true
concurrency if we want to run checkpoints in parallel in the
future because it side-steps the Python's Global Interpreter Lock(GIT).

In the CUD architecture, checkpoints uses the Data Object Cache(DOC) to
share information with each other.   The DOC is currently designed to
work within the same memory space as the checkpoints.  If we were to
run checkpoints in subprocesses using the multiprocessing module, we can no
longer use the DOC as currently implemented.


Problem:
---------

Investigate the feasibility on allowing checkpoints to share DOC data with
minimal and localized changes when engine switches to running checkpoints
in subprocesses in the future.

Possible Solution 1:
--------------------

1) Create a customized manager based onmultiprocessing.managers.BaseManager.


The multiprocessing module provides a manager object class that controls a
server process which holds Python objects and allows subprocesses to
manipulate them using proxies.  The manager provided in the multiprocessing
module supports various Python objects.  To share user defined objects
such as the DataObjectCache, we need to create a customized manager, and
register objects we want to share.  Creating a customized manager and
registering the objects as proxies is relatively straight forward.
For example:

-------------
class MyManager(BaseManager):
    pass

MyManager.register("doc", DataObjectCache)
-------------

To start the manager and create an proxy object:

------------------
manager = MyManager()
manager.start()
my_doc = manager.doc()
------------------

Now, application/checkpoints/engine you can call all the public
methods of the "my_doc" object like usual.
If you want to call private methods or properties on
the DataObjectCache object, you can easily create a customized proxy.

The major problem with using this approach is returned values from

the DataObject.get_XXXXX() functions, such as get_first_child(), areautomatically

converted to objects in the process' space.  Therefore, if you first
do a doc.get_first_child() to retrieve an object, and then, you want to
add a child to that object, it will not use the proxy version of
the object because it is already in the process's space.  So, you can not
build the DOC structure we have today by operating on the DataObjects like
we do now.

Reference links on share memory manager provided by the multiprocessingmodule:

*http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes

* http://docs.python.org/library/multiprocessing.html#customized-managers

Possible Solution 2:
---------------------

Create a separate server similar to "ManifestServ" in old DC for managing
the DOC data.

- This works OK for dealing with simple data, such as strings andnumbers, in the DOC.

  It does not work well for returning complex data.  Additional code
  is needed to marshall and unmarshall complex data.

- There will be challenges for inserting data, such as:
  * synchronizing access, so, they don't overwrite each other's data.
  * If they are specifying a path for insertion that does not already
    exist, how do we deal with the problem?

* If they specify a path that does not lead to a unique node forinsertion,

    how do you deal with that?

- There's also the overhead of having to manage this server. Makingsure that

it is started up correctly, passing a reference of it around,
cleanup completely when it is no longer needed.

- All code that currently uses the DOC will need to be updated to use
this server.

Possible Solution 3:
---------------------

As suggested by Darren, the engine can pickle the DOC and pass it
as an argument to the checkpoint.  Checkpoints will return a copy of the
updated DOC to the engine after it completes execution, and the engine
can update DOC by calling doc.load_from_snapshot().

If multiple processes run in parallel and they all made modifications
to the DOC, we will need extra code to merge the difference.  If
checkpoints make changes in similar area of the DOC, merging
might or might not be possible.

My recommendation:
---------------------

Possible solution 1 seems to involve the least amount of
work since it utilizes the share manager feature of
the multiprocessing module.

To over come the problem of not being able to work on the proxy version
of DOC objects, I suggest adding an UUID to each object that's stored
in DataObjectCache when the objects are inserted.   This way, when
we want to operate (search, insert_child, modify properties,
delete) on the objects, they can be referred to by it's UUID.
In addition, instead of operating on the individual
DataObjects like we do now, we issue all request to the
DataObjectCache proxy object that's
created by the engine,  The DataObjectCache class will be enhanced
to locate the object by UUID, and perform the requested operation.

Proposed changes:

1) In the engine's __init__(), create the share memory manager andcreate the

   doc as a proxy to DataObjectCache.
2) Update the implementation of DataObjectCache/DataObject to add UUID
   to objects as discussed above.

3) Update the implementation of DataObjectCache to be able to searchobjects by

   the provided UUID and perform the appropriate action on them.
4) Update the implementation of the DataObject such that

when you call a function that acts on the object, eg:DataObject.insert_children(),that function will first get a reference to the engine, and then,look up

   the doc_proxy that's store in the engine and call
   engine.doc_proxy.insert_children(UUID, object_to_insert).

I believe these changes will allow changes to support running
checkpoints in subprocess to be localized to the engine and DOC only.

Thanks,

--Karen

_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

[caiman-discuss] Investigation and recommendation on using Python's multiprocessing module for running checkpoint

Reply via email to