Polishing node identifier (at-codes) use cases.

Bert Verhees Tue, 24 Sep 2013 01:10:10 +0200

Op 23-9-2013 14:21, Thomas Beale schreef:
> On 23/09/2013 11:47, Bert Verhees wrote:
>> On 09/23/2013 10:38 AM, Thomas Beale wrote:
>>> On 20/09/2013 20:40, Bert Verhees wrote:
>>>> Op 20-9-2013 17:01, Thomas Beale schreef:
>>>>> it's simpler than you think - we made that property mandatory so 
>>>>> that programmers would never get a null exception.
>>>> Must have been along time ago, nowerdays, programmers have no 
>>>> problem handling a null property.
>>>
>>> actually, that's not quite true. It's probably the primary reason 
>>> for exceptions in object-oriented software - method call on a void 
>>> object. But I get what you are saying, and for this String field, 
>>> being null would not pose a great problem. So we could change the 
>>> spec to do that.
>>
>> Yes, it is very easy to catch a null-exception and then do something 
>> with that information. Anyway, IMHO, specs should not solve technical 
>> problems, and they mostly don't do that. I believe this is also 
>> defined in UML.
>>
>> Technical problems are for implementers to solve.
>
> Hi Bert,
>
> I don't happen to believe in that philosophy. Here's why: if you leave 
> too much open, for implementers to constantly decide, then the 1,000 
> people (let's say) who download your specification will solve ...... [ 
> skip, skip, skip ] ......problems. That's 333 x 10 = another 3,330 
> hours gone.


Sorry for skipping this, but I don't think this is relevant in the 
discussion.

There is really something good in the UML-philosophy which says not to 
interfere in implementing, but keep specifications clean of 
implementation-issues.
In this discussion are two things which illustrate that very well.

Specifications thought of some time ago have tried to solve 
possible-implementation-errors by interfering in software-development.
You are demonstrating that this UML-philosophy is good. I will explain this.

First, we can say about the specs that they are most of the text 
designed with this UML-philosophy in mind: no database-platform defined, 
no database structure defined, no programming language, no platform 
defined, everything in the OpenEHR-specs is open in a way that honors 
this UML-philosophy, which I think is good.

But there are a few exceptions in the OpenEHR-design:
One is having a small issue in the design explained in this argument:  
"we made that property mandatory so that programmers would never get a 
null exception. ".
The other is having one single property in the design for different 
things, to avoid errors, as you explain below (I disagree).

And then you bring in some arbitrary calculations as argument, I already 
skipped them.
> That's 5,330 hours, or over 2 person years. It clearly makes sense to 
> spend 10, 2 ...... [skip, skip, skip] ...... at ambiguity is the enemy 
> of good software and interoperability, and of efficiency in development.
>
> For that reason I believe specifications should very carefully specify 
> things. I'll give a very simple example. The openEHR specifications 
> routinely specify which properties of a class are mandatory, optional, 
> and which String fields have to be non-empty. Even those simple things 
> help save time.

What time do you save? Allowing developers to write sloppy code because 
they don't need to check for a null-value?
Do you think that professional programmers are not able to apply basic 
programming rules, to check for a null value when retrieving data from a 
database or external source?

I don't know which quality of software-development you expected in the 
OpenEHR community when writing this spec, but it does not seem that you 
had much confidence in developers, at that time.

>
> Now, the actual openEHR specs of course have some errors, and wrong 
> decisions. The original specs that most people use today (but are 
> about to be revised) probably have some wrong decisions made by me, as 
> a best guess at the time of the best way to limit ambiguity.
>
> So what is really needed is for the communities around each 
> development technology to build up common reference software 
> components that become the one true way (for today) of doing X in 
> Java, or Y in Python. If developers start saying 'X is a strange 
> decision', and upon analysis, there is a better way to do X with no 
> impact on data, quality, performance etc, we should do it. That's how 
> we should progress.
>
> But I don't believe in 'leave it to the programmers' because I don't 
> believe in 'programming', I only believe in 'design', carried out at 
> different levels of granularity.

It is inefficient to have an empty string instead of a null value, it is 
a waste of processor-time. Now, programmers must check for the contents 
of a string, if it is empty then it must be considered null.
Checking for a null-string (which does not exist in memory) is much more 
efficient. No String calculations needed, no object creation, etc.
It is basic code-optimization, never instantiate a variable if you want 
it to be null. Your specs force software to be unnecessary inefficient.

You are taking responsibility for errors bad or unexperienced 
programmers could eventually make.
It shows disdain for most developers. Ivory tower we call that in the 
Netherlands.

>
>>
>> That is why this is a strange decision.
>>
>>>
>>>>
>>>> I wonder what the idea behind stuffing the archetype_id in the 
>>>> archetype_node_id property is?
>>>> Here you make it harder for programmers because the archetype_id 
>>>> has another syntax in archetype-paths then the archetype_node_id 
>>>> has, and anyway, lots of other functions, and a programmer has to 
>>>> check the string-layout to find out if it is an archetype_id or an 
>>>> archetype_node_id. It also blocks the possibility to store the 
>>>> "at"-code for the root, and check the ontology for its contents.
>>>
>>> the idea is that there is only one field to look at to find 
>>> archetype identifying information in data. It is either an 
>>> archetype_id (string form) or an at-code, or (for systems that 
>>> support it) it's empty / 'unknown' (which could be replaced by 
>>> null/void). With the archetype id, you can always look up the 
>>> archetype and find out the root code (at0000, or a matching pattern 
>>> like at0000.1 or at0000.1.1). But if you can't look up the 
>>> archetype, you are lost, and that's what the archetype_id is for.
>>
>> The point is, the archetype_id is stored in the property 
>> archetype_node_id, Pablo implemented it like that in XML, and he 
>> found in the specs it should be that way. I think this is an unneeded 
>> complication of the specs. Better was to assign a special property 
>> for the archetype_id, besides the archetype_node_id.
>
> Well we thought about that a long time ago, and the view was that then 
> you will have two fields in every LOCATABLE, one of which (hopefully) 
> is null/void in each actual instance. This could easily lead to 
> errors, and wastes a data property.

I don't see any errors for having different properties for different 
things.
I see errors in having different things in the same property.

A waste of a data-property?

I do not understand what you are trying to say.  Do you mean that there 
are occasions in which a specific property is useless?
Because it is not used? Then I must say that OpenEHR has a lot of waste, 
because there are many properties which are not used all the time.
:)

Why is that a waste? Because of database-space?

Maybe it is this: It must be because you don't want null-values and want 
to put empty strings in the place.
That is indeed a waste, I explained above, it is a waste of memory, 
processor-time, database-usage.
There, in that design-part, you justify a waste.

Maybe it is time to give some responsibility of software-development  to 
software-developers and stop thinking about decisions as
- using one property for two different things
- using empty-strings to indicate a null value

This is the big-data-society in which programmers are educated in their 
profession. You should trust them more then you do now.
As you say, you thought about this a long time ago. That was also my 
thought about this, and it would be good to change this.

>
>>
>> He found this spec in common.pdf, section 3.1.2 where is stated:
>> "The archetype_node_id is the standardised semantic code for a node 
>> and comes
>> from the corresponding node in the archetype used to create the data. 
>> The only exception is at archetype
>> root points in data, where archetype_node_id carries the archetype 
>> identifier in string form rather
>> than an interior node id from an archetype."
>>
>> This makes it difficult to implement, because, an implementer has to 
>> test if the archetype_node_id contains an at-code or an archetype_id. 
>> This can lead to ambiguities, for example if XML contains the 
>> archetype-slots and the connected instances are embedded, which is 
>> legal and can really speed up XPath-queries. This possibility 
>> ambiguities is special the possible because it is not really hard 
>> defined what an at-code looks at.
>
> We certainly need to make sure that the pathing in the XML expression 
> of the specifications works as it should. I'm not sure if I understand 
> your last statement though.

Imagine an archetype-slot, for example, for having contacts in a PERSON.
There are two ways of implementing it in object-instances or XML-instances.
One way is:
Having different instances, connected via a not in the specs defined 
connection indicating that one instance should be placed inside the 
property of another instance.
Talking about errors, here is a situation in which the specs fail to 
indicate how the connection must be made, and it is left to implementors.

Seeing that the spec fail to specify this (and the specs want to protect 
us against simple programming-errors), we must conclude that the specs 
want us to really implement archetype-slotted instances to be a 
materialized part of the containing instance.

I think this is a wise thing to do. Because, what do you want to do with 
data?
You want to query them, and do this as efficient as possible. You want 
database-indexes to be used to find values for ADL-paths (which are 
easily translated to object-instance-paths or XPaths)
The whole OpenEHR ecosystem is build around ADL-paths: AQL, templates, etc.

Imagine you write a query which retrieves for you a PERSON (as an 
object-instance or an XML-instance, or another instanced way), and in 
that person are paths, ADL paths.

Two difficulties arise:
One:
Now you write software to analyze that PERSON, and you see the 
"contact"-property, and you don't know at that moment if that contact is 
included via slots, or is included via a large PERSON-archetype.
So in that case, you need to analyze the contents of archetype-node-id 
of the contacts to detect if it is an archetype_id in it or an at-code.
This is very hard, and maybe impossible to do this trustworthy. So the 
programmer has to check the archetypes to check this.
This is a big waste, unnecessary. A waste of a lot of processor-time, 
thousands lines of code are involved to read the archetype and check if 
a string in the "contact"-property is an at-code or an archetype_id.

Two:
Imagine writing a AQL-engine on a database. As we know, the syntax for 
an archetype_id is completely different from the syntax for an 
archetype_node_id. But the writer of the engine needs to find these 
completely different things in one property, with no indication which is 
what, especially in slotted-instance-sets. I think that you can see how 
difficult that is, he needs, as in the previous problem, to check 
archetypes to know if the contents of that property is an archetype_id, 
and interpret/create the ADL-path accordingly.

This is not a wild example, we all need to create AQL-engines, to use 
the OpenEHR ecosystem as meant in the specs. It is not very hard to do, 
because, ADL is very similar to XPath, and I think that object-database, 
also have object-path-queries. So it is easy to translate, but we still 
need to do that, and create/interpret ADL-paths.

The situation you have created, as you state, to avoid errors is causing 
errors or unnecessary difficulties and causing thousands of lines of 
code to be used (wasted processortime).

I hope you agree that this is an error and I hope that you will take 
care that these two things (the other also in this email) will be 
changed in the specs.
Thanks for your attention
Bert
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.openehr.org/pipermail/openehr-technical_lists.openehr.org/attachments/20130924/690f7429/attachment-0001.html>

Polishing node identifier (at-codes) use cases.

Reply via email to