Polishing node identifier (at-codes) use cases.

Bert Verhees Wed, 25 Sep 2013 01:53:58 +0200

Op 24-9-2013 19:54, Thomas Beale schreef:
> On 24/09/2013 00:10, Bert Verhees wrote:
>>
>>>
>>> For that reason I believe specifications should very carefully 
>>> specify things. I'll give a very simple example. The openEHR 
>>> specifications routinely specify which properties of a class are 
>>> mandatory, optional, and which String fields have to be non-empty. 
>>> Even those simple things help save time.
>>
>> What time do you save? Allowing developers to write sloppy code 
>> because they don't need to check for a null-value?
>> Do you think that professional programmers are not able to apply 
>> basic programming rules, to check for a null value when retrieving 
>> data from a database or external source?
>>
>> I don't know which quality of software-development you expected in 
>> the OpenEHR community when writing this spec, but it does not seem 
>> that you had much confidence in developers, at that time.
>
> it's not developers like you or many of the other careful, thoughtful 
> and professional people on these lists. But there are huge numbers of 
> developers out there whose main job is implementing something else, 
> but who have to quickly 'put something together' for this or that 
> project, typically in a department of health, hospital or other 
> provider site. These people have to write code in a rushed way, and 
> will inevitably solve things as fast as possible without deep 
> contemplation. And yet - those pieces of software routinely end up in 
> real health data processing environments. So the aim of the specs is 
> to reduce errors by this kind of development.
>
> Like I said, particular choices in the specs to achieve that might be 
> wrong, and the community here needs to help improve that.


So we can stop this discussion right here. I respect that you wanted to 
express your opinion on my message, but there is no need for me to 
comment on this. We agree that shit happens all the time, but apart from 
that that you will support a change of spec regarding the empty string 
representing a null value issue.

But we have the other issue of having one property to store two 
different things without indication what is stored. You call having a 
property which is not often used a waste.

>>
>> A waste of a data-property?
>>
>> I do not understand what you are trying to say.  Do you mean that 
>> there are occasions in which a specific property is useless?
>> Because it is not used? Then I must say that OpenEHR has a lot of 
>> waste, because there are many properties which are not used all the time.
>> :)
>
> sure - if you have a separate property to store the archetype id, it 
> is empty in 95% of all object instances, and also you need a class 
> invariant to prevent it being filled at the same time as the 
> archetype_node_id (at-code) property.

I must disagree, it is very common in archetypes, I think it is in 90% 
of the archetypes that the root of a definition also has a node_id. So 
in that case both can occur simultaneously. But in the path only the 
archetype_id will occur, and it is easier for a programmer to find which 
one is the archetype_id if it is in a separate property.

And anyway, I don't think a seldom used property is a waste. It is only 
bits and bytes, and there is hardly any code involved having this 
property. But as I showed in example, not having this property can make 
many thousands of lines code-execution necessary. That is a waste.

We, system-builders, and special system-designers like you, do not 
decide which archetypes are going to be used.
There are archetypes of megabytes, they exist. I don't think it is wise 
to have them, but it is that modeling is not always focused on 
performance, but more on academical medical ideas.
We, builders of two level modeling systems, we must be able to live with 
this kind of academic exercises.

But those archetypes cost one second ore more, just parsing on a medium 
speed computer.
You don't want to do this unnecessary, you don't want to parse that kind 
of archetypes at every data-entry. It breaks your system.

Because there is no sure way of analyzing a string and find out if it is 
an archetype_node_id or an archetype_id in slotted situations besides 
parsing and analyzing the archetype, this will make the situation of 
having one property for two different values inefficient, and in some 
situations dramatic inefficient.
>> One way is:
>> Having different instances, connected via a not in the specs defined 
>> connection indicating that one instance should be placed inside the 
>> property of another instance.
>> Talking about errors, here is a situation in which the specs fail to 
>> indicate how the connection must be made, and it is left to implementors.
>>
>> Seeing that the spec fail to specify this (and the specs want to 
>> protect us against simple programming-errors), we must conclude that 
>> the specs want us to really implement archetype-slotted instances to 
>> be a materialized part of the containing instance.
>
> If you are referring to what the data instance structure looks like, 
> yes if the reference model says it is inline (i.e. included by value) 
> then that's what it is. The corresponding archetype structure 
> technically could be made of multiple archetypes, connected by slots, 
> or by one large archetype acting as a template.

The idea of what I was saying, I think I can express it more clear now, 
is that there are two ways of embedding a slotted dataset (based on an 
archetype which fits in the slot) in the containing dataset (based on 
the archetype which has the slot, so to say, the containing archetype)

One way is to add a reference to the container-dataset, which points to 
the slotted dataset.
The other way is to add the slotted dataset materialized in the 
container-dataset.
(The expression "materialized" is from oracle)

The first one is not described in the specs, so to say, there is no spec 
which indicates how to reference the datasets.
In theory the specs expect the second situation. The paths in AQL or 
templates are defined if the slotted datasets are materialized inside 
the containing dataset.
This is also the most simple way to do this.

This causes, however, a problem.

Imagine you have a dataset and you want to express a path to a leaf-value.
You must know in that case if there are slotted datasets in it, because 
the path will follow other syntax rules in case of slots.

So in a PERSON without slots a contact would look like this

[person-archetype]/contacts[at0003]/items[at0004].............

In a PERSON with slots it would look like this.
[person-archetype]/contacts[at0003]/[contact_archetypeId]/items[at0004].............

So if you have a large dataset and you want to express ADL-paths to 
leaf-nodes, you need to know if there are slots.
There is one way to find out. Parse the according archetype and find out 
if there are slots.
You need to do that because you cannot trust the string analyzing of the 
archetype_node_id.
So you have to execute thousands of lines of code to find out if an 
archetype contains slots.

If there was a separate property for archetype_id, then it would only be 
looking at the accordingly property if it has a null value, (or an empty 
string :)

>>
>> I think this is a wise thing to do. Because, what do you want to do 
>> with data?
>> You want to query them, and do this as efficient as possible. You 
>> want database-indexes to be used to find values for ADL-paths (which 
>> are easily translated to object-instance-paths or XPaths)
>> The whole OpenEHR ecosystem is build around ADL-paths: AQL, 
>> templates, etc.
>>
>> Imagine you write a query which retrieves for you a PERSON (as an 
>> object-instance or an XML-instance, or another instanced way), and in 
>> that person are paths, ADL paths.
>>
>> Two difficulties arise:
>> One:
>> Now you write software to analyze that PERSON, and you see the 
>> "contact"-property, and you don't know at that moment if that contact 
>> is included via slots, or is included via a large PERSON-archetype.
>> So in that case, you need to analyze the contents of 
>> archetype-node-id of the contacts to detect if it is an archetype_id 
>> in it or an at-code.
>> This is very hard, and maybe impossible to do this trustworthy. So 
>> the programmer has to check the archetypes to check this.
>
> well to check in the data if you have an archetype id or an at-code, 
> it's just going to be something like:
>
> if (archetype_details != null) {
>     // archetype_node_id contains an archeytpe id
> }
> else {
>     // archetype_node_id contains an at-code
> }
>
> the Common IM spec 
> <http://www.openehr.org/releases/trunk/architecture/rm/common_im.pdf> 
> says this - see p 22 - invariants:
>
> Archetyped_valid: is_archetype_root xor archetype_details = Void

This is indeed a way to handle this, but what bothers me in this case, 
two things.
- You cannot have an XPath engine doing this complex querying, it makes 
path-based queries very complex, and maybe even impossible.
- Maybe technical not so important, but the property name does not 
indicate what it contains, and it is bad programming practice to have 
misleading names.

I understand that having an archetype_id property creates redundant 
information, because the information already is in the archetype_details 
property, but the same also goes for storing the archetype_id in the 
archetype_node_id. I think this redundancy is ugly, and should not 
occur.  I think redundancy is a design error. The reason is that the 
archetype_details contain other information besides the archetype_id.

The best way to do would be a separate archetype_id property, and 
eventually archetype_details without archetype_id, or find another way 
for the details, these details are also in archetype itself.

>
>>
>> Two:
>> Imagine writing a AQL-engine on a database. As we know, the syntax 
>> for an archetype_id is completely different from the syntax for an 
>> archetype_node_id. But the writer of the engine needs to find these 
>> completely different things in one property, with no indication which 
>> is what, especially in slotted-instance-sets. I think that you can 
>> see how difficult that is, he needs, as in the previous problem, to 
>> check archetypes to know if the contents of that property is an 
>> archetype_id, and interpret/create the ADL-path accordingly.
>
> I'm not sure where the difficulty lies. I don't believe any of the 
> implementations of AQL have had any great difficulties in this area. 
> Whatever path is provided in a query, the AQL engine just looks for 
> it. It can easily do this in quite a dumb way.

I am not sure what it means if there are two different paths possible to 
one data-leaf. One path with the slot defined, and one path as if there 
was no slot.
A few weeks ago we both argued to William Goossens that the path is the 
identifier for a datapoint, not the archetype_node_id.
But now you seem to imply that there are more then one paths-definitions 
possible.

By the way, it is getting late (again)

>
> I can imagine that one day in the future we use Snomed-like codes for 
> both at-code and archetype id, which would mean it's the same kind of 
> code always in the property archetype_node_id in a Locatable, but that 
> wouldn't make a lot of difference.

Lets hope this will never happen. How difficult can we get :)

Bert
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.openehr.org/pipermail/openehr-technical_lists.openehr.org/attachments/20130925/eab44636/attachment.html>

Polishing node identifier (at-codes) use cases.

Reply via email to