Re: New ExecDB

Kamil Paral Wed, 12 Oct 2016 04:05:55 -0700

> On Tue, Oct 11, 2016 at 1:14 PM, Kamil Paral < kpa...@redhat.com > wrote:


> > > Proposal looks good to me, I don't have any strong objections.
> > 
> 

> > > 1. If you don't like blame: UNIVERSE, why not use blame: TESTBENCH?
> > 
> 
> > > 2. I think that having enum values in details in crash structure would be
> > > better, but I don't have strong opinion either way.
> > 
> 

> > For consistency checking, yes. But it's somewhat inflexible. If the need
> > arises, I imagine the detail string can be in json format (or
> > semicolon-separated keyvals or something) and we can store several useful
> > properties in there, not just one.
> 

> I'd rather do the key-value thing as we do in ResultsDB than storing plalin
> Json. Yes the new Postgres can do it (and can also search it to some
> extent), but it is not all-mighty, and has its own problems.

Sure, that would be even better (but you didn't sound like supporting that idea 
a few days back). So having "state", "blame" and "keyvals" would work well, in 
my opinion. 

> > E.g. not only that Koji call failed, but what was its HTTP error code. Or
> > not
> > that dnf install failed, but also whether it was the infamous "no more
> > mirror to try" error or a dependency error. I don't want to misuse that to
> > store loads of data, but this could be useful to track specific issues we
> > have hard times to track currently (e.g. our still existing depcheck issue,
> > that happens only rarely and it's difficult for us to get a list of tasks
> > affected by it). With this, we could add a flag "this is related to problem
> > XYZ that we're trying to solve".
> 

> I probably understand, what you want, but I'd rather have a specified set of
> values, which will/can be acted upon. Maybe changing the structure to
> `{state, blame, cause, details}`, where the `cause` is still an enum of
> known values but details is freeform, but strictly used for humans?

Hm, I can't actually imagine any case where freeform text *strictly* for humans 
would get too useful. But I can imagine cases where the field is useful for 
humans *and* machines. As an example, we currently have several known causes 
why minions are not properly created. Linking to backingstores might fail, the 
libvirt VM creation or startup might fail, and we might time out while waiting 
for the machine. Those are 3 different reasons for the same overall problem in 
"MINION_CREATION". So, I imagine we can store it like this: 

state: CRASHED 
blame: TASKOTRON 
keyvals: 
step: MINION_CREATION 
reason: TIMED_OUT 

or like this: 

state: CRASHED 
blame: TASKOTRON 
cause: MINION_CREATION 
keyvals: 
reason: TIMED_OUT 

All of that is human readable and machine parseable. The top level keys are 
enums in the database, the keyvals are defined in libtaskotron and the database 
doesn't care. Of course, keyvals are keyvals, so we can easily add something 
like "human_description: foo" if we need it and use it for some use cases, or 
even "time_elapsed: 70" or whatever. It's keyvals, so we can easily ignore the 
ones we don't care about, and use them only in specific cases, or whatever. 
They are flexible and can be easily extended. 

And now we finally get to the point. When Tim asks "do you know how often 
backing stores linking fails during minion creation?" (as he really did in 
T833), we can say "here's a simple query to find out!". Or, if it is not 
implemented yet, we can say "let's add new 'reason: BACKINGSTORE_LINKING' and 
let's see in a month!". Because currently we can't really say anything and the 
best thing to do is to grep all logs, or make some custom hack on our servers. 

> So we can "CRASHED->THIRDPARTY->UNKNOWN->"text of the exception" for example,

I understand it's just an example, but in this case it would be probably 
CRASHED->UNKNOWN->UNKNOWN, because if you don't know to cause, you can't blame 
anybody. 

> or "CRASHED->TASKOTRON->NETWORK->"dnf - no more mirrors to try".

And here the blame would be THIRDPARTY, since it's NETWORK :-) 

I'd prefer keyvals over this, because you can't fit anything useful for 
querying to the free-form details field (or, as you said, you'd rather avoid 
that). The fact that I know it crashed because of network might not be enough 
information to debug issues, decide rescheduling, etc. 

> I'd rather act on a known set of values, then have code like:

You probably meant 'than', right? I'm a bit confused here. 

> if ('dnf' in detail and 'no more mirrors' in detail) or ('DNF' in detail and
> 'could not connect' in detail)....

Do the example with keyvals I posted above look better? I'd say using the 
keyvals with known enum values (defined in libtaskotron) is definitely more 
realiable and maintainable than matching free-form text like above. 

> in the end, it is almost the same, because there will be problems with
> clasifying the errors, and the more layers we add, the harder it gets - that
> is the reason I initially only wanted to do the {state, blame} thing. But I
> feel that this is not enough (just state and blame) information for us to
> act upon - e.g. to decide when to automatically reschedule, and when not,
> but I'm afraid that with the exploded complexity of the 'crashed states' the
> code for handling the "should we reschedule" decisions will be awfull.

Right. But the logic whether to reschedule has to be somewhere. If it lives 
somewhere around trigger, we need to expose enough information to help it 
decide. Because "cause: MINION_CREATION" or "cause: NETWORK" might not mean we 
want to reschedule _always_ (I don't have any concrete use cases now). So 
exposing more details in "step" and "reason" (or whatever) keyvals might be 
used to help with that decision (and of course, they serve also other purposes 
on top of that, like debugging). 

The second approach is to put the logic into libtaskotron, and if it decides 
the task should be rescheduled, it can set "reschedule: TRUE/FALSE" value (as a 
database column, or a keyval). That looks very nice from database POV, but I'm 
not sure that libtaskotron is a good place to put this logic in. 

> Notyfiing the right party is fine (that is what blame gives us), but this is
> IMO what we should focus on a bit.

> Tim, do you have any comments?

> _______________________________________________
> qa-devel mailing list -- qa-devel@lists.fedoraproject.org
> To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org

_______________________________________________
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org

Re: New ExecDB

Reply via email to