Cyclic datatypes: OpenEHR virus

Bert Verhees Wed, 14 May 2014 11:05:29 +0200

It should not be focused on OpenEHR, I guess many more modeling-systems 
(not only two level) can have same situations, but I guess developers 
have caught most of them. But since this is an OpenEHR-list, and I run 
against this problem in an OpenEHR kernel, I discuss it in the context 
of OpenEHR.



On 14-05-14 03:04, pablo pazos wrote:
> Hi Bert,
>
> why the validator should need to continue traversing the instance?
>
>
> Hi Pablo, because in the attributes are often also complex OpenEhr 
> datatypes, so the validator needs to check these complex data types in 
> the attributes too, and those datatypes again can have complex 
> datatypes. In case of this example: Dv_Text matches {*} you'll need to 
> check everything, every structure, until you reach the leaf nodes, 
> which, in this example can be anything. Only then, you can be sure 
> that the data set is OpenEhr compliant.
>
> That was my point :) The validation that needs to reach leaf nodes is 
> not the archetype validation, but the IM structure validation. That 
> has nothing to do with the open constraint {*} in the archetype. In 
> fact, that validation can be done completely without considering the 
> archetype. What I said about using the XSD is just one way of 
> implementation, you can do that by code also.
This, I do not agree, an archetype constraints the structure of a 
dataset, and it constraints the contents of a leafnode. The archetype 
constraints, IMHO, which class-attributes are to be used, and what/where 
they will lead to.
Of course the archetype is modeled inside the boundaries of the 
reference model (possible expressed in XSD), so that is always something 
that has to fit.
If there are no constraints in an archetype (wildcarded) then 
everything, which is possible in the reference model, is legal in a dataset.


>
> The thing is that a DvText can have the attribute: mappings and 
> then can find a the attribute: purpose, of type DvCodedText, which 
> again can have an attribute: mappings, which can again have an 
> attribute: purpose, etc.
>
> I got it ;)
>
> So, the occurrence of the leafnode can be far away, and still be 
> compliant with the statement: DvText matches {*}, and a 100% compliant 
> validator will need to follow al these steps. Of course this is not a 
> normal situation, but it can happen. As said, we cannot always control 
> incoming data sets. There maybe buggy software in the ecosystem where 
> a kernel runs.
>
> That really depends on implementation. Let say the system doesn't 
> control the input, so you can receive anything, for example binary 
> data where you expect a dv_quantity. In that case, what I proposed 
> implicitly is to have a 2 phase validator, 1st syntactic (against the 
> IM, yes we need to reach leaf nodes here!), 2nd semantic (IMO we can 
> prune the validator if we reach stuff like {*}). If the 1st phase 
> returns invalid, there's no need to execute the 2nd. If you execute 
> the second, you'll never reach an infinite recursion because of pruning.
>
> Sorry, maybe I can't explain myself clearly, is difficult to show the 
> on email. Maybe others can validate or deny this.
There is a lot of validation done by the libraries you probably use. For 
example, if you import XML, the Xerces, or other libraries check the XML 
if it contains no errors. If it is a character-stream, libraries check 
for illegal binary-codes, many many checks are done inside libraries. 
And before and after you libraries, the operating system does a lot of 
checking too, for example if you do nothing illegal with memory, and you 
have virus-scanners which sit on your network-stack.
I guess half of the code on a computer, on all levels, does nothing but 
checking for errors.

The problem with this situation is that nothing illegal happens, there 
is no error in the dataset, the problem can occur inside a dataset, 
which is fully compliant with the Reference Model.

>
> To be safe and with feasibility in mind, a validator would need to 
> stop validating, at some arbitrary point, although there is no error. 
> So a validator which follow the rules for 100% is dangerous! it can 
> crash a system.
>
> Having two phase validators,

I was thinking about something like that, but I could not imagine a 
pattern which would handle this. Maybe you can give us some pointers.

> I don't know if there's any case that you didn't cover 100% and might 
> get valid from invalid data or cover 100% and end with stack overflow. 
> Finding a counter case would be enough to invalid my proposal :)

As I set, the dataset which can cause the problem is fully OpenEHR 
compliant, there are no invalid data involved, in that case, it would be 
easy to handle.

> That was my point.
>
> You are right in your statement, that when a part of an archetype is 
> wildcarded, the XSD is the place where to find the validation rules.
>
> Maybe the problem is trying to validate against the archetype at first 
> and then validate the IM. I think it should be IM 1st and AM 2nd. But 
> of course, I may overlooked some pathological case and this might not 
> work on 100% of the cases.

I don't know why you call it pathological, as if it has to do with 
humans, something with freaks who want to crash systems, it can surely 
be the case.
But more likely is a system which has a bug, and creates a faulty 
dataset. A "pathological" dataset can be the result of a buggy system.

And you know Murphy: If a problem CAN occur, it WILL occur.


>
> Another thing that might be helpful is not to use archetypes directly, 
> use OPTs. I learned that in the hard way. OPTs can contain the whole 
> structure and constraints of specific compositions. So if someone 
> specifies DV_TEXT in the OPT, my interpretation is they don't need a 
> DV_CODED_TEXT there. Also, an OPT is all in one file, while with 
> archetypes you have to deal with slots (argghhhh). In fact, right now 
> I'm changing all my systems adding OPT support. Simpler to validate, 
> simpler to query.

This is not conforming the specs. If in an archetype a DV_TEXT is 
defined, a DV_CODED_TEXT is legal in the dataset. Inheritance-rules are 
valid in datasets.

> (argghhhh)


I see that you dislike slots, it is another discussion, but in my view, 
slots are the best way to make a data-definition flexible, extensible. 
But lets not distract from the original discussion theme.
Maybe you can work this point out in a new thread. I would welcome 
discussion about slots.

Best regards
Bert
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.openehr.org/pipermail/openehr-technical_lists.openehr.org/attachments/20140514/4d805d93/attachment.html>

Cyclic datatypes: OpenEHR virus

Reply via email to