On Wednesday, September 18, 2013 9:07:12 AM UTC-5, Andy Parker wrote: > > On Tue, Sep 17, 2013 at 1:19 PM, John Bollinger > <john.bo...@stjude.org<javascript:> > > wrote: > >> >> >> On Monday, September 16, 2013 11:12:58 AM UTC-5, Andy Parker wrote: >> >>> On Mon, Sep 16, 2013 at 6:56 AM, John Bollinger >>> <john.bo...@stjude.org>wrote: >>> >>>> >>>> >>>> On Monday, September 16, 2013 6:48:17 AM UTC-5, Andy Parker wrote: >>>>> >>>>> >>>>> The problem with this picture for being able to batch operations >>>>> together, is that everything turns into calls on the Puppet::Type >>>>> instance, >>>>> which then makes all of the individual calls to the provider. To batch, >>>>> we >>>>> need to group resources together and then give the groups to the provider. >>>>> >>>> >>>> >>>> So you don't think my suggestion above to let providers assemble and >>>> apply batches is workable? I think it requires only one or two extra >>>> signals to the provider (via the Type), to mark batch boundaries. Most of >>>> the magic would happen in those providers that choose to perform it, and >>>> those that don't choose to perform it can just ignore the new signals. >>>> The >>>> main part of the protocol between the agent core and types/providers >>>> remains unchanged. >>>> >>>> I haven't delved into the details of how exactly it would be >>>> implemented, so perhaps there is a show-stopper there, but I'm not seeing >>>> a >>>> flaw in the basic idea. >>>> >>>> >>> No, you are right. I forgot about that one. I was just running through >>> the code, the biggest problem that I can see so far is simply that there >>> isn't "the provider". We end up with a provider instance per resource, as >>> far as I can tell. Others have solved that by tracking data on the provider >>> class. >>> >>> I think that for the batching we just need a way of asking a provider if >>> two resources (for the same provider class) are batchable. The comparison >>> of batchable needs to be transitive (so if A and B are batchable, and so >>> are B and C, then all A, B, and C are). In fact it needs to also be >>> symmetric and reflexive, since it is really just another form of equality. >>> That helps us to define what can be batched together. >>> >>> >> >> I think an equivalence relation may be stronger than is needed. It >> should be sufficient to be able to answer this weaker question: given a set >> S of mutually batchable resources and a resource R not in S, can R be >> batched together with all the resources of S? It is possible for a >> provider type to be able to batch resources on that basis, but not >> meaningfully to batch resources based on a full equivalence relation. >> >> > The difference just comes down to how conservative the approach should be. > The equivalence relation leaves out the possibility of a provider being > able to say that of a set of three resources A, B, C that it can batch [A, > B] or [B, C], but not [A, B, C] (because of the transitive constraint). The > way that you state it would allow it to do that. >
Yes. > I actually think that leaving the batching that open would lead to > unwanted variations of batches between runs. > How so? Puppet is deterministic, is it not? Of course, if manifests change then it may be that batching changes, but that's a possibility no matter how batching is implemented. I think as a practical matter, the variations arising from a more open formulations are far more likely to involve packages that are batched incidentally -- e.g. packages declared by different classes, maybe -- than to involve packages where batching provides a desired functional benefit. I think the implementation could ensure that, if desired. > By expressing the batches in the form of an comparison operator then we > can guarantee consistency (unless the implementation of the comparison does > not conform to the requirements). > I don't think you get any more consistency this way. You may get different batches than you do with the other approach, but both are sensitive to the same variations in the sequence in which resources are proposed for batching. > There is actually also a run time problem. The comparison operator allows > batches to be built in O(n) time, where n is the number of resources to > consider, whereas, in general, the set inclusion method would be O(n^2). > An implementation that must compare each offered resource to all the others already in the set would consume O(n^2) time, but such implementations are by no means the only possibilities. In particular, implementations based on equivalence classes are another alternative that providers could choose to employ, and we agree that such an implementation would require only O(n) time. Moreover, there are countless potential implementations specific to particular resource types or provider types that have suitable characteristics. > I think having the batching criteria a bit more conservative ends up being > a win because of easy of definition and speed of execution even if it will > miss some cases where it could have created a batch. > > I think defining the interface in such a way that it allows more flexibility ends up being a win because it allows providers more freedom to customize behavior as appropriate for their underlying implementation. Inasmuch as the same equivalence-class approach you propose can be implemented under the more flexible scheme, there can be no *inherent*speed penalty in allowing greater flexibility. The interface definition and sequence of operations can be at least as simple in the more flexible case, as I will show, and it affords the opportunity for bigger batches, which is also a performance advantage. Indeed, inasmuch as running external commands is very expensive, I think you have to get to very large batch sizes indeed before even an O(n^2) batching algorithm loses despite reducing the number of batches > I agree, the provider type specifies what can be in a batch (via the > comparison discussed above), the core decides what candidates to try to > batch. > > >> Consider, for example, the "yum" Package provider. Because of yum's >> nature, the provider cannot easily support batching out-of-sync packages >> ensured 'absent' with out-of-sync packages ensured 'installed', 'latest', >> or <version>, but as long as external considerations (e.g. relationships >> with other resources) do not preclude it, the yum provider could >> simultaneously build separate batches for the two categories. That would >> allow for larger batches to be formed under some circumstances, and it >> could be essential for correct operation of removals in others. >> >> More generally, batching under control of providers would allow batches >> of different provider types and even of different resource types to be >> formed simultaneously, provided always that the application order of the >> relevant resources is not constrained. >> >> > Are you thinking that a provider should somehow batch together resources > of different type? That seems like a very accident prone thing to do. > > No, I am thinking that multiple distinct providers could have batch assemblies in-flight simultaneously. Also, some individual providers might have multiple in-flight at the same time. It might work like this: 1. Initially there are zero batches. 2. The transaction chooses a "next" resource by any mechanism of its choice. 3. The transaction determines whether any of the extrinsic criteria it can evaluate itself prevent the resource from being included in a batch with previous ones. Some formulations of this test could be very cheap. 4. If extrinsic considerations prevent batching, then the transaction sends a "flush" message that causes all in-flight batches to be applied by the provider classes maintaining them. 5. The transaction dispatches the resource for application, using the existing interface that already serves that purpose 6. The provider for the resource chooses whether to apply it immediately or whether to add it to an existing or new batch. It may choose at this time to apply one or more entire batches that it had had in-flight. 7. If there are more resources then return to step 2. 8. The transaction sends a "flush" message, as in step 4, to ensure that the last batches are applied. Note 1: Where I say "transaction" above, it might be more appropriate to say "resource graph" or to name some other component. I'm still not focusing on implementation details, though you should feel free to raise potential implementation issues if you see any. Note 2: By "extrinsic" criteria I mean criteria that can be evaluated without consulting providers. Prime among these are resource relationships. One way this could be cheap is for the transaction to work on resources in blocks: initially all those that are ready to be applied immediately relative to explicit and automatic relationships. All resources in such a block are considered mutually batchable in the extrinsic sense, but not batchable with anything else. After each block is dispatched, the next block of all resources then ready to be applied immediately is determined. Note 3: The ability to target "flush" messages appropriately at steps (4) and (8) may require a mechanism whereby providers can subscribe to such messages when they batch a resource instead of applying it immediately. Note 4: If you don't like the idea of multiple providers assembling batches simultaneously, then you can include a restriction against that among the extrinsic considerations that may prevent batching. In that case, it might be advantageous to use an implementation of step (2) that attempts to group resources by provider. > One thing that I'm still uncertain about is how to handle failures, and > what should appear in the report. Should all resources get the same event? > Should they all fail or succeed together? Is that something that the > provider gets to decide. I think the provider should be able to decide it > so that if it is able to separate the parts of the batch, then it can > report on that, and if it isn't, then it just gives them all the same > status. > Very good questions. Regarding eventing: the change is least intrusive if each resource continues to publish its own events, but it's not clear to me whether there is any practical, observable difference. Regarding failing or succeeding together: are you talking only about reporting / logging or also about strategies for handling failures? For example, a provider might handle a failed batch by attempting to apply the member resources individually. I do think that when resources are applied in a batch, it makes sense for the log and report to reflect that. If a batch fails but a fallback strategy partially or wholly succeeds, then at least the log should reflect all of the above. Overall, however, I'm all in favor of allowing providers as much freedom as feasible. > > Should the report contain information about what the batch was? What else > might the report contain? > I think non-trivial batches should be reported. I don't immediately know what use the information might be put to, but I think it reasonable to expect that all the creative people in the community will find uses. > Based on this, I think that at a minimum, the new interface for providers > is: > > * Provider::batchable?(resource1, resource2) > * Provider::batch_start > * Provider::batch_end > > I think the minimum new interface for Providers is just Provider::flush_batches (class method) If multiple providers are allowed to assemble batches simultaneously, then there might be a need for Transaction::register_batch_listener(provider_class) In addition, there are several potential advantages to having Providers to store batches in the transaction object instead of in class variables. If that's desired then it might be useful to also have something like Transaction::set_attribute(key, value) Transaction::get_attribute(key) (or else something more specific to batches). > I think the base Provider class also can do something to help track the > current batch for the implementations. This could be in the form of > individual methods to track the batch, but I think better is a generic > system for the provider type to track state (I dislike mutable state > tracking, but I am not seeing a way around this at the moment). > > That could be the transaction attributes proposed above. One of the great advantages of tracking state in the transaction instead of in Provider class variables is that there is no risk of stale data being carried over from one transaction to another when the agent runs in daemon mode. > I'm a little worried about the implications for noop, purging and deleting > and non-ensure batching (do we make it so that batching is just part of the > ensure branch? I think so). Right now Puppet::Transaction::ResourceHarness > has a *lot* of logic around what to do in various situations. > > I don't immediately see any implications for noop mode. I don't see any fundamentally new issues related to purging or deleting resources, but where these are the actions the agent must perform it has implications on how batches can be formed. I think it is necessary to let providers sort that out. Of course, there are no doubt issues that I don't see. I don't think batching needs to extend to non-ensure properties directly, or perhaps at all, but it might be nice if those providers that perform flushing could have the option at flush time to enroll resources in a batch instead of immediately updating the target resource. John -- You received this message because you are subscribed to the Google Groups "Puppet Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+unsubscr...@googlegroups.com. To post to this group, send email to puppet-dev@googlegroups.com. Visit this group at http://groups.google.com/group/puppet-dev. For more options, visit https://groups.google.com/groups/opt_out.