Re: [caiman-discuss] Investigation and recommendation on using Python's multiprocessing module for running checkpoint

Darren Kenny Mon, 11 Oct 2010 03:15:28 -0700

Hi Karen, et al.

I've been looking at this in a bit more detail, and I'm still not convinced
that using the multiprocessing module in the Engine is the correct approach to
this - sorry - but I'll explain why...


1) Proxying the DOC

    I tried to take some unit tests in the DOC and modify them to use the
    multiprocessing module and the Manager mechanism.

    I started out trying to simply proxy DataObject, but that failed since
    it's an Abstract Base Class, so I would have to proxy a class that could
    be instantiated in it's own right.

    So I created a light wrapper (DataObjectLocalProxy) that wrapped a
    DataObject by taking it as a parameter, and calling the proxied object's
    methods.

    This could be managed by the Manager class... So some progress, but still
    not perfect since while it proxied the methods, the returned values
    weren't proxies but real objects.

    So I then modified the returning values of DataObject methods in
    DataObjectLocalProxy such that it would then proxy the returned objects
    too.

    So this made some progress, and I seemed to be getting proxied objects.

    Unfortunately I also noticed that the tests were taking considerably
    longer to complete:

        Not using multi-processing:
        
                Ran 42 tests in 0.027s
        
        Using multi-processing:
        
                Ran 42 tests in 11.611s

    So the question is, why is this happening - and it seems to be down to the
    fact that anytime a DataObject is copied over, it's not just that object
    that's copied over but the whole tree it represents!

    This is mainly due to the parent/child references...

    While it is possible to change how objects are pickled, doing this will
    also effect the snapshot mechanism.

    So, just adding the UUID mechanism will not suffice, it would be necessary
    to replace all parent/child pointers with UUID references, which the DOC
    would maintain in a dictionary. The main issue I see here is managing
    memory - when can we safely remove a UUID reference and allow garbage
    collection to occur.

2)  Effect on other modules

    The introduction of the multiprocessing module in the engine, doesn't just
    effect the DOC - it also has an effect on most other modules that are
    shared services, e.g.:

    - Logging

      How is the writing to the logs co-ordinated here when running in
      multiple processes?

    - Error Service

      The error service is a singleton, and it was never expected to share
      this between processes either, but it is expected that any layer
      could access it, even in C libraries.

    - Pause/Resume

      This also presents issues, not just in the DOC, but in applications like
      DC that wish to support it - a single list of resumable points isn't
      correct anymore and would require a definition of a dependency tree to
      ensure correct behaviour.

Conclusions, and possible alternative solution:

    As mentioned before I'm not convinced that using the multiprocessing
    module in the Engine is the correct approach to allowing things to be run
    in parallel, for the issues that you can see above.

    I totally understand that it's seen as easier to make a change like this
    now than it would be in the future once things have been released - but I
    also feel that to do it would actually have a huge impact in terms of the
    effort to make the change even now.

    But I also see that it makes sense to be able to run some things in
    parallel - albeit in a limited way.

    So a possible solution that I see is the following:

    - The engine doesn't use multiprocessing at all, but does allow for
      multi-threading of checkpoints, so they can be run in parallel.

      By default, checkpoints would be run in sequence, but it would be
      possible for a checkpoint to state the it can be run in parallel.

      To get around the Python Interpreter Locking (PIL) issue, the checkpoint
      itself would handle the running of things in a separate process, and
      would yield it's thread (and hence the PIL) until that process
      terminated. This would also allow the cancel() method to be able to kill
      the external process as needed...

      This bypasses much of the issues that we have w.r.t. access to the
      application address space since the checkpoint can use other mechanisms
      to pass the information to the sub-process, and vice versa.

    - For this to work, we do need to revisit many of the modules still, but
      in a much simpler way, to ensure that they work correctly in a
      multi-threaded environment and use the appropriate locking.

      - The DOC, and Error Service would need to put locks into
        each core object.

      - The Engine would need to add support for a parallel flag, and the
        running of items in parallel.

        I'm not quite sure how to handle the snapshot/restore yet, this still
        presents an issue w.r.t. when to take a snapshot, and when it's save
        to resume...

This is my take on the multi-processing support, and I would also like to hear
people's opinion on this.

Thanks,

Darren.



On 10/ 8/10 12:47 AM, Karen Tung wrote:
>   This is a write up of my investigation and recommendation on using
> Python's multiprocessing module to run checkpoints in the engine.
> 
> Background:
> -------------
> 
> Currently, the engine runs checkpoints sequentially in a separate thread.
> As discussed during the engine's code review, using the multiprocessing 
> module
> will allow us to have better control on the checkpoints.  For example,
> the engine can kill checkpoints instead of relying on checkpoints to
> behave correctly when a cancel request is sent to it.  Using the
> multiprocessing module also provides the benefit of achieving true
> concurrency if we want to run checkpoints in parallel in the
> future because it side-steps the Python's Global Interpreter Lock(GIT).
> 
> In the CUD architecture, checkpoints uses the Data Object Cache(DOC) to
> share information with each other.   The DOC is currently designed to
> work within the same memory space as the checkpoints.  If we were to
> run checkpoints in subprocesses using the multiprocessing module, we can no
> longer use the DOC as currently implemented.
> 
> 
> Problem:
> ---------
> 
> Investigate the feasibility on allowing checkpoints to share DOC data with
> minimal and localized changes when engine switches to running checkpoints
> in subprocesses in the future.
> 
> Possible Solution 1:
> --------------------
> 
> 1) Create a customized manager based on 
> multiprocessing.managers.BaseManager.
> 
> The multiprocessing module provides a manager object class that controls a
> server process which holds Python objects and allows subprocesses to
> manipulate them using proxies.  The manager provided in the multiprocessing
> module supports various Python objects.  To share user defined objects
> such as the DataObjectCache, we need to create a customized manager, and
> register objects we want to share.  Creating a customized manager and
> registering the objects as proxies is relatively straight forward.
> For example:
> 
> -------------
> class MyManager(BaseManager):
>      pass
> 
> MyManager.register("doc", DataObjectCache)
> -------------
> 
> To start the manager and create an proxy object:
> 
> ------------------
> manager = MyManager()
> manager.start()
> my_doc = manager.doc()
> ------------------
> 
> Now, application/checkpoints/engine you can call all the public
> methods of the "my_doc" object like usual.
> If you want to call private methods or properties on
> the DataObjectCache object, you can easily create a customized proxy.
> 
> The major problem with using this approach is returned values from
> the DataObject.get_XXXXX() functions, such as get_first_child(), are 
> automatically
> converted to objects in the process' space.  Therefore, if you first
> do a doc.get_first_child() to retrieve an object, and then, you want to
> add a child to that object, it will not use the proxy version of
> the object because it is already in the process's space.  So, you can not
> build the DOC structure we have today by operating on the DataObjects like
> we do now.
> 
> 
> Reference links on share memory manager provided by the multiprocessing 
> module:
> 
> * 
> http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes
>  
> 
> * http://docs.python.org/library/multiprocessing.html#customized-managers
> 
> Possible Solution 2:
> ---------------------
> 
> Create a separate server similar to "ManifestServ" in old DC for managing
> the DOC data.
> 
> - This works OK for dealing with simple data, such as strings and 
> numbers, in the DOC.
>    It does not work well for returning complex data.  Additional code
>    is needed to marshall and unmarshall complex data.
> 
> - There will be challenges for inserting data, such as:
>    * synchronizing access, so, they don't overwrite each other's data.
>    * If they are specifying a path for insertion that does not already
>      exist, how do we deal with the problem?
>    * If they specify a path that does not lead to a unique node for 
> insertion,
>      how do you deal with that?
> 
> - There's also the overhead of having to manage this server.  Making 
> sure that
> it is started up correctly, passing a reference of it around,
> cleanup completely when it is no longer needed.
> 
> - All code that currently uses the DOC will need to be updated to use
> this server.
> 
> Possible Solution 3:
> ---------------------
> 
> As suggested by Darren, the engine can pickle the DOC and pass it
> as an argument to the checkpoint.  Checkpoints will return a copy of the
> updated DOC to the engine after it completes execution, and the engine
> can update DOC by calling doc.load_from_snapshot().
> 
> If multiple processes run in parallel and they all made modifications
> to the DOC, we will need extra code to merge the difference.  If
> checkpoints make changes in similar area of the DOC, merging
> might or might not be possible.
> 
> My recommendation:
> ---------------------
> 
> Possible solution 1 seems to involve the least amount of
> work since it utilizes the share manager feature of
> the multiprocessing module.
> 
> To over come the problem of not being able to work on the proxy version
> of DOC objects, I suggest adding an UUID to each object that's stored
> in DataObjectCache when the objects are inserted.   This way, when
> we want to operate (search, insert_child, modify properties,
> delete) on the objects, they can be referred to by it's UUID.
> In addition, instead of operating on the individual
> DataObjects like we do now, we issue all request to the
> DataObjectCache proxy object that's
> created by the engine,  The DataObjectCache class will be enhanced
> to locate the object by UUID, and perform the requested operation.
> 
> Proposed changes:
> 
> 1) In the engine's __init__(), create the share memory manager and 
> create the
>     doc as a proxy to DataObjectCache.
> 2) Update the implementation of DataObjectCache/DataObject to add UUID
>     to objects as discussed above.
> 3) Update the implementation of DataObjectCache to be able to search 
> objects by
>     the provided UUID and perform the appropriate action on them.
> 4) Update the implementation of the DataObject such that
>     when you call a function that acts on the object, eg: 
> DataObject.insert_children(),
>     that function will first get a reference to the engine, and then, 
> look up
>     the doc_proxy that's store in the engine and call
>     engine.doc_proxy.insert_children(UUID, object_to_insert).
> 
> I believe these changes will allow changes to support running
> checkpoints in subprocess to be localized to the engine and DOC only.
> 
> Thanks,
> 
> --Karen
> 
> _______________________________________________
> caiman-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/caiman-discuss
_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Investigation and recommendation on using Python's multiprocessing module for running checkpoint

Reply via email to