Hi Karen, et al.
I've been looking at this in a bit more detail, and I'm still not convinced
that using the multiprocessing module in the Engine is the correct approach to
this - sorry - but I'll explain why...
1) Proxying the DOC
I tried to take some unit tests in the DOC and modify them to use the
multiprocessing module and the Manager mechanism.
I started out trying to simply proxy DataObject, but that failed since
it's an Abstract Base Class, so I would have to proxy a class that could
be instantiated in it's own right.
So I created a light wrapper (DataObjectLocalProxy) that wrapped a
DataObject by taking it as a parameter, and calling the proxied object's
methods.
This could be managed by the Manager class... So some progress, but still
not perfect since while it proxied the methods, the returned values
weren't proxies but real objects.
So I then modified the returning values of DataObject methods in
DataObjectLocalProxy such that it would then proxy the returned objects
too.
So this made some progress, and I seemed to be getting proxied objects.
Unfortunately I also noticed that the tests were taking considerably
longer to complete:
Not using multi-processing:
Ran 42 tests in 0.027s
Using multi-processing:
Ran 42 tests in 11.611s
So the question is, why is this happening - and it seems to be down to the
fact that anytime a DataObject is copied over, it's not just that object
that's copied over but the whole tree it represents!
This is mainly due to the parent/child references...
While it is possible to change how objects are pickled, doing this will
also effect the snapshot mechanism.
So, just adding the UUID mechanism will not suffice, it would be necessary
to replace all parent/child pointers with UUID references, which the DOC
would maintain in a dictionary. The main issue I see here is managing
memory - when can we safely remove a UUID reference and allow garbage
collection to occur.
2) Effect on other modules
The introduction of the multiprocessing module in the engine, doesn't just
effect the DOC - it also has an effect on most other modules that are
shared services, e.g.:
- Logging
How is the writing to the logs co-ordinated here when running in
multiple processes?
- Error Service
The error service is a singleton, and it was never expected to share
this between processes either, but it is expected that any layer
could access it, even in C libraries.
- Pause/Resume
This also presents issues, not just in the DOC, but in applications like
DC that wish to support it - a single list of resumable points isn't
correct anymore and would require a definition of a dependency tree to
ensure correct behaviour.
Conclusions, and possible alternative solution:
As mentioned before I'm not convinced that using the multiprocessing
module in the Engine is the correct approach to allowing things to be run
in parallel, for the issues that you can see above.
I totally understand that it's seen as easier to make a change like this
now than it would be in the future once things have been released - but I
also feel that to do it would actually have a huge impact in terms of the
effort to make the change even now.
But I also see that it makes sense to be able to run some things in
parallel - albeit in a limited way.
So a possible solution that I see is the following:
- The engine doesn't use multiprocessing at all, but does allow for
multi-threading of checkpoints, so they can be run in parallel.
By default, checkpoints would be run in sequence, but it would be
possible for a checkpoint to state the it can be run in parallel.
To get around the Python Interpreter Locking (PIL) issue, the checkpoint
itself would handle the running of things in a separate process, and
would yield it's thread (and hence the PIL) until that process
terminated. This would also allow the cancel() method to be able to kill
the external process as needed...
This bypasses much of the issues that we have w.r.t. access to the
application address space since the checkpoint can use other mechanisms
to pass the information to the sub-process, and vice versa.
- For this to work, we do need to revisit many of the modules still, but
in a much simpler way, to ensure that they work correctly in a
multi-threaded environment and use the appropriate locking.
- The DOC, and Error Service would need to put locks into
each core object.
- The Engine would need to add support for a parallel flag, and the
running of items in parallel.
I'm not quite sure how to handle the snapshot/restore yet, this still
presents an issue w.r.t. when to take a snapshot, and when it's save
to resume...
This is my take on the multi-processing support, and I would also like to hear
people's opinion on this.
Thanks,
Darren.
On 10/ 8/10 12:47 AM, Karen Tung wrote:
> This is a write up of my investigation and recommendation on using
> Python's multiprocessing module to run checkpoints in the engine.
>
> Background:
> -------------
>
> Currently, the engine runs checkpoints sequentially in a separate thread.
> As discussed during the engine's code review, using the multiprocessing
> module
> will allow us to have better control on the checkpoints. For example,
> the engine can kill checkpoints instead of relying on checkpoints to
> behave correctly when a cancel request is sent to it. Using the
> multiprocessing module also provides the benefit of achieving true
> concurrency if we want to run checkpoints in parallel in the
> future because it side-steps the Python's Global Interpreter Lock(GIT).
>
> In the CUD architecture, checkpoints uses the Data Object Cache(DOC) to
> share information with each other. The DOC is currently designed to
> work within the same memory space as the checkpoints. If we were to
> run checkpoints in subprocesses using the multiprocessing module, we can no
> longer use the DOC as currently implemented.
>
>
> Problem:
> ---------
>
> Investigate the feasibility on allowing checkpoints to share DOC data with
> minimal and localized changes when engine switches to running checkpoints
> in subprocesses in the future.
>
> Possible Solution 1:
> --------------------
>
> 1) Create a customized manager based on
> multiprocessing.managers.BaseManager.
>
> The multiprocessing module provides a manager object class that controls a
> server process which holds Python objects and allows subprocesses to
> manipulate them using proxies. The manager provided in the multiprocessing
> module supports various Python objects. To share user defined objects
> such as the DataObjectCache, we need to create a customized manager, and
> register objects we want to share. Creating a customized manager and
> registering the objects as proxies is relatively straight forward.
> For example:
>
> -------------
> class MyManager(BaseManager):
> pass
>
> MyManager.register("doc", DataObjectCache)
> -------------
>
> To start the manager and create an proxy object:
>
> ------------------
> manager = MyManager()
> manager.start()
> my_doc = manager.doc()
> ------------------
>
> Now, application/checkpoints/engine you can call all the public
> methods of the "my_doc" object like usual.
> If you want to call private methods or properties on
> the DataObjectCache object, you can easily create a customized proxy.
>
> The major problem with using this approach is returned values from
> the DataObject.get_XXXXX() functions, such as get_first_child(), are
> automatically
> converted to objects in the process' space. Therefore, if you first
> do a doc.get_first_child() to retrieve an object, and then, you want to
> add a child to that object, it will not use the proxy version of
> the object because it is already in the process's space. So, you can not
> build the DOC structure we have today by operating on the DataObjects like
> we do now.
>
>
> Reference links on share memory manager provided by the multiprocessing
> module:
>
> *
> http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes
>
>
> * http://docs.python.org/library/multiprocessing.html#customized-managers
>
> Possible Solution 2:
> ---------------------
>
> Create a separate server similar to "ManifestServ" in old DC for managing
> the DOC data.
>
> - This works OK for dealing with simple data, such as strings and
> numbers, in the DOC.
> It does not work well for returning complex data. Additional code
> is needed to marshall and unmarshall complex data.
>
> - There will be challenges for inserting data, such as:
> * synchronizing access, so, they don't overwrite each other's data.
> * If they are specifying a path for insertion that does not already
> exist, how do we deal with the problem?
> * If they specify a path that does not lead to a unique node for
> insertion,
> how do you deal with that?
>
> - There's also the overhead of having to manage this server. Making
> sure that
> it is started up correctly, passing a reference of it around,
> cleanup completely when it is no longer needed.
>
> - All code that currently uses the DOC will need to be updated to use
> this server.
>
> Possible Solution 3:
> ---------------------
>
> As suggested by Darren, the engine can pickle the DOC and pass it
> as an argument to the checkpoint. Checkpoints will return a copy of the
> updated DOC to the engine after it completes execution, and the engine
> can update DOC by calling doc.load_from_snapshot().
>
> If multiple processes run in parallel and they all made modifications
> to the DOC, we will need extra code to merge the difference. If
> checkpoints make changes in similar area of the DOC, merging
> might or might not be possible.
>
> My recommendation:
> ---------------------
>
> Possible solution 1 seems to involve the least amount of
> work since it utilizes the share manager feature of
> the multiprocessing module.
>
> To over come the problem of not being able to work on the proxy version
> of DOC objects, I suggest adding an UUID to each object that's stored
> in DataObjectCache when the objects are inserted. This way, when
> we want to operate (search, insert_child, modify properties,
> delete) on the objects, they can be referred to by it's UUID.
> In addition, instead of operating on the individual
> DataObjects like we do now, we issue all request to the
> DataObjectCache proxy object that's
> created by the engine, The DataObjectCache class will be enhanced
> to locate the object by UUID, and perform the requested operation.
>
> Proposed changes:
>
> 1) In the engine's __init__(), create the share memory manager and
> create the
> doc as a proxy to DataObjectCache.
> 2) Update the implementation of DataObjectCache/DataObject to add UUID
> to objects as discussed above.
> 3) Update the implementation of DataObjectCache to be able to search
> objects by
> the provided UUID and perform the appropriate action on them.
> 4) Update the implementation of the DataObject such that
> when you call a function that acts on the object, eg:
> DataObject.insert_children(),
> that function will first get a reference to the engine, and then,
> look up
> the doc_proxy that's store in the engine and call
> engine.doc_proxy.insert_children(UUID, object_to_insert).
>
> I believe these changes will allow changes to support running
> checkpoints in subprocess to be localized to the engine and DOC only.
>
> Thanks,
>
> --Karen
>
> _______________________________________________
> caiman-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/caiman-discuss
_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss