On Tue, Oct 11, 2016 at 1:14 PM, Kamil Paral <kpa...@redhat.com> wrote:

> Proposal looks good to me, I don't have any strong objections.
> 1. If you don't like blame: UNIVERSE, why not use blame: TESTBENCH?
> 2. I think that having enum values in details in crash structure would be
> better, but I don't have strong opinion either way.
> For consistency checking, yes. But it's somewhat inflexible. If the need
> arises, I imagine the detail string can be in json format (or
> semicolon-separated keyvals or something) and we can store several useful
> properties in there, not just one.

I'd rather do the key-value thing as we do in ResultsDB than storing plalin
Json. Yes the new Postgres can do it (and can also search it to some
extent), but it is not all-mighty, and has its own problems.

> E.g. not only that Koji call failed, but what was its HTTP error code. Or
> not that dnf install failed, but also whether it was the infamous "no more
> mirror to try" error or a dependency error. I don't want to misuse that to
> store loads of data, but this could be useful to track specific issues we
> have hard times to track currently (e.g. our still existing depcheck issue,
> that happens only rarely and it's difficult for us to get a list of tasks
> affected by it). With this, we could add a flag "this is related to problem
> XYZ that we're trying to solve".
I probably understand, what you want, but I'd rather have a specified set
of values, which will/can be acted upon. Maybe changing the structure to
`{state, blame, cause, details}`, where the `cause` is still an enum of
known values but details is freeform, but strictly used for humans? So we
can "CRASHED->THIRDPARTY->UNKNOWN->"text of the exception" for example, or
"CRASHED->TASKOTRON->NETWORK->"dnf - no more mirrors to try".

I'd rather act on a known set of values, then have code like:

    if ('dnf' in detail and 'no more mirrors' in detail) or ('DNF' in
detail and 'could not connect' in detail)....

in the end, it is almost the same, because there will be problems with
clasifying the errors, and the more layers we add, the harder it gets -
that is the reason I initially only wanted to do the {state, blame} thing.
But I feel that this is not enough (just state and blame) information for
us to act upon - e.g. to decide when to automatically reschedule, and when
not, but I'm afraid that with the exploded complexity of the 'crashed
states' the code for handling the "should we reschedule" decisions will be
awfull. Notyfiing the right party is fine (that is what blame gives us),
but this is IMO what we should focus on a bit.

Tim, do you have any comments?
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org

Reply via email to