Re: [Puppet-dev] Ideas for Batch Processing of Packages

John Bollinger Thu, 19 Sep 2013 09:38:09 -0700


On Wednesday, September 18, 2013 9:07:12 AM UTC-5, Andy Parker wrote:
>
> On Tue, Sep 17, 2013 at 1:19 PM, John Bollinger 
> <john.bo...@stjude.org<javascript:>
> > wrote:
>
>>
>>
>> On Monday, September 16, 2013 11:12:58 AM UTC-5, Andy Parker wrote:
>>
>>> On Mon, Sep 16, 2013 at 6:56 AM, John Bollinger 
>>> <john.bo...@stjude.org>wrote:
>>>
>>>>
>>>>
>>>> On Monday, September 16, 2013 6:48:17 AM UTC-5, Andy Parker wrote:
>>>>>
>>>>>
>>>>> The problem with this picture for being able to batch operations 
>>>>> together, is that everything turns into calls on the Puppet::Type 
>>>>> instance, 
>>>>> which then makes all of the individual calls to the provider. To batch, 
>>>>> we 
>>>>> need to group resources together and then give the groups to the provider.
>>>>>
>>>>
>>>>
>>>> So you don't think my suggestion above to let providers assemble and 
>>>> apply batches is workable?  I think it requires only one or two extra 
>>>> signals to the provider (via the Type), to mark batch boundaries.  Most of 
>>>> the magic would happen in those providers that choose to perform it, and 
>>>> those that don't choose to perform it can just ignore the new signals.  
>>>> The 
>>>> main part of the protocol between the agent core and types/providers 
>>>> remains unchanged.
>>>>
>>>> I haven't delved into the details of how exactly it would be 
>>>> implemented, so perhaps there is a show-stopper there, but I'm not seeing 
>>>> a 
>>>> flaw in the basic idea.
>>>>
>>>>
>>> No, you are right. I forgot about that one. I was just running through 
>>> the code, the biggest problem that I can see so far is simply that there 
>>> isn't "the provider". We end up with a provider instance per resource, as 
>>> far as I can tell. Others have solved that by tracking data on the provider 
>>> class.
>>>
>>> I think that for the batching we just need a way of asking a provider if 
>>> two resources (for the same provider class) are batchable. The comparison 
>>> of batchable needs to be transitive (so if A and B are batchable, and so 
>>> are B and C, then all A, B, and C are). In fact it needs to also be 
>>> symmetric and reflexive, since it is really just another form of equality. 
>>> That helps us to define what can be batched together.
>>>
>>>
>>
>> I think an equivalence relation may be stronger than is needed.  It 
>> should be sufficient to be able to answer this weaker question: given a set 
>> S of mutually batchable resources and a resource R not in S, can R be 
>> batched together with all the resources of S?  It is possible for a 
>> provider type to be able to batch resources on that basis, but not 
>> meaningfully to batch resources based on a full equivalence relation.
>>
>>
> The difference just comes down to how conservative the approach should be. 
> The equivalence relation leaves out the possibility of a provider being 
> able to say that of a set of three resources A, B, C that it can batch [A, 
> B] or [B, C], but not [A, B, C] (because of the transitive constraint). The 
> way that you state it would allow it to do that.
>



Yes.

 

> I actually think that leaving the batching that open would lead to 
> unwanted variations of batches between runs.
>


How so?  Puppet is deterministic, is it not?

Of course, if manifests change then it may be that batching changes, but 
that's a possibility no matter how batching is implemented.  I think as a 
practical matter, the variations arising from a more open formulations are 
far more likely to involve packages that are batched incidentally -- e.g. 
packages declared by different classes, maybe -- than to involve packages 
where batching provides a desired functional benefit.  I think the 
implementation could ensure that, if desired.

 

> By expressing the batches in the form of an comparison operator then we 
> can guarantee consistency (unless the implementation of the comparison does 
> not conform to the requirements).
>


I don't think you get any more consistency this way.  You may get different 
batches than you do with the other approach, but both are sensitive to the 
same variations in the sequence in which resources are proposed for 
batching.

 

> There is actually also a run time problem. The comparison operator allows 
> batches to be built in O(n) time, where n is the number of resources to 
> consider, whereas, in general, the set inclusion method would be O(n^2).
>


An implementation that must compare each offered resource to all the others 
already in the set would consume O(n^2) time, but such implementations are 
by no means the only possibilities.  In particular, implementations based 
on equivalence classes are another alternative that providers could choose 
to employ, and we agree that such an implementation would require only O(n) 
time.  Moreover, there are countless potential implementations specific to 
particular resource types or provider types that have suitable 
characteristics.

 

> I think having the batching criteria a bit more conservative ends up being 
> a win because of easy of definition and speed of execution even if it will 
> miss some cases where it could have created a batch.
>  
>


I think defining the interface in such a way that it allows more 
flexibility ends up being a win because it allows providers more freedom to 
customize behavior as appropriate for their underlying implementation.  
Inasmuch as the same equivalence-class approach you propose can be 
implemented under the more flexible scheme, there can be no *inherent*speed 
penalty in allowing greater flexibility.  The interface definition 
and sequence of operations can be at least as simple in the more flexible 
case, as I will show, and it affords the opportunity for bigger batches, 
which is also a performance advantage.  Indeed, inasmuch as running 
external commands is very expensive, I think you have to get to very large 
batch sizes indeed before even an O(n^2) batching algorithm loses despite 
reducing the number of batches



> I agree, the provider type specifies what can be in a batch (via the 
> comparison discussed above), the core decides what candidates to try to 
> batch.
>  
>
>> Consider, for example, the "yum" Package provider.  Because of yum's 
>> nature, the provider cannot easily support batching out-of-sync packages 
>> ensured 'absent' with out-of-sync packages ensured 'installed', 'latest', 
>> or <version>, but as long as external considerations (e.g. relationships 
>> with other resources) do not preclude it, the yum provider could 
>> simultaneously build separate batches for the two categories.  That would 
>> allow for larger batches to be formed under some circumstances, and it 
>> could be essential for correct operation of removals in others.
>>
>> More generally, batching under control of providers would allow batches 
>> of different provider types and even of different resource types to be 
>> formed simultaneously, provided always that the application order of the 
>> relevant resources is not constrained.
>>
>>
> Are you thinking that a provider should somehow batch together resources 
> of different type? That seems like a very accident prone thing to do.
>  
>


No, I am thinking that multiple distinct providers could have batch 
assemblies in-flight simultaneously.  Also, some individual providers might 
have multiple in-flight at the same time.

It might work like this:

1. Initially there are zero batches.
2. The transaction chooses a "next" resource by any mechanism of its choice.
3. The transaction determines whether any of the extrinsic criteria it can 
evaluate itself prevent the resource from being included in a batch with 
previous ones.  Some formulations of this test could be very cheap.  
4. If extrinsic considerations prevent batching, then the transaction sends 
a "flush" message that causes all in-flight batches to be applied by the 
provider classes maintaining them.
5. The transaction dispatches the resource for application, using the 
existing interface that already serves that purpose
6. The provider for the resource chooses whether to apply it immediately or 
whether to add it to an existing or new batch.  It may choose at this time 
to apply one or more entire batches that it had had in-flight.
7. If there are more resources then return to step 2.
8. The transaction sends a "flush" message, as in step 4, to ensure that 
the last batches are applied.

Note 1: Where I say "transaction" above, it might be more appropriate to 
say "resource graph" or to name some other component.  I'm still not 
focusing on implementation details, though you should feel free to raise 
potential implementation issues if you see any.

Note 2: By "extrinsic" criteria I mean criteria that can be evaluated 
without consulting providers.  Prime among these are resource 
relationships.  One way this could be cheap is for the transaction to work 
on resources in blocks: initially all those that are ready to be applied 
immediately relative to explicit and automatic relationships.  All 
resources in such a block are considered mutually batchable in the 
extrinsic sense, but not batchable with anything else.  After each block is 
dispatched, the next block of all resources then ready to be applied 
immediately is determined.

Note 3: The ability to target "flush" messages appropriately at steps (4) 
and (8) may require a mechanism whereby providers can subscribe to such 
messages when they batch a resource instead of applying it immediately.

Note 4: If you don't like the idea of multiple providers assembling batches 
simultaneously, then you can include a restriction against that among the 
extrinsic considerations that may prevent batching.  In that case, it might 
be advantageous to use an implementation of step (2) that attempts to group 
resources by provider.



> One thing that I'm still uncertain about is how to handle failures, and 
> what should appear in the report. Should all resources get the same event? 
> Should they all fail or succeed together? Is that something that the 
> provider gets to decide. I think the provider should be able to decide it 
> so that if it is able to separate the parts of the batch, then it can 
> report on that, and if it isn't, then it just gives them all the same 
> status.
>


Very good questions.

Regarding eventing: the change is least intrusive if each resource 
continues to publish its own events, but it's not clear to me whether there 
is any practical, observable difference.

Regarding failing or succeeding together: are you talking only about 
reporting / logging or also about strategies for handling failures?  For 
example, a provider might handle a failed batch by attempting to apply the 
member resources individually.  I do think that when resources are applied 
in a batch, it makes sense for the log and report to reflect that.  If a 
batch fails but a fallback strategy partially or wholly succeeds, then at 
least the log should reflect all of the above.

Overall, however, I'm all in favor of allowing providers as much freedom as 
feasible.

 

>
> Should the report contain information about what the batch was? What else 
> might the report contain?
>


I think non-trivial batches should be reported.  I don't immediately know 
what use the information might be put to, but I think it reasonable to 
expect that all the creative people in the community will find uses.
 


>  Based on this, I think that at a minimum, the new interface for providers 
> is:
>
>   * Provider::batchable?(resource1, resource2)
>   * Provider::batch_start
>   * Provider::batch_end
>
>
I think the minimum new interface for Providers is just

Provider::flush_batches (class method)

If multiple providers are allowed to assemble batches simultaneously, then 
there might be a need for

Transaction::register_batch_listener(provider_class)

In addition, there are several potential advantages to having Providers to 
store batches in the transaction object instead of in class variables.  If 
that's desired then it might be useful to also have something like

Transaction::set_attribute(key, value)
Transaction::get_attribute(key)

(or else something more specific to batches).

 

> I think the base Provider class also can do something to help track the 
> current batch for the implementations. This could be in the form of 
> individual methods to track the batch, but I think better is a generic 
> system for the provider type to track state (I dislike mutable state 
> tracking, but I am not seeing a way around this at the moment).
>
>
That could be the transaction attributes proposed above.  One of the great 
advantages of tracking state in the transaction instead of in Provider 
class variables is that there is no risk of stale data being carried over 
from one transaction to another when the agent runs in daemon mode.

 

> I'm a little worried about the implications for noop, purging and deleting 
> and non-ensure batching (do we make it so that batching is just part of the 
> ensure branch? I think so). Right now Puppet::Transaction::ResourceHarness 
> has a *lot* of logic around what to do in various situations.
>
>

I don't immediately see any implications for noop mode.  I don't see any 
fundamentally new issues related to purging or deleting resources, but 
where these are the actions the agent must perform it has implications on 
how batches can be formed.  I think it is necessary to let providers sort 
that out.  Of course, there are no doubt issues that I don't see.

I don't think batching needs to extend to non-ensure properties directly, 
or perhaps at all, but it might be nice if those providers that perform 
flushing could have the option at flush time to enroll resources in a batch 
instead of immediately updating the target resource.


John

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-dev+unsubscr...@googlegroups.com.
To post to this group, send email to puppet-dev@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-dev.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Puppet-dev] Ideas for Batch Processing of Packages

Reply via email to