Re: [cellml-discussion] Proposal: BCP for including external codeinCellML models

Andrew Miller Sun, 18 Mar 2007 15:17:37 -0800

Matt wrote:
> I have often thought referencing external code through a clearly
> defined interface would be useful, and mostly because procedural code
> is another natural way to solve problems. But I have always banged my
> head up against validation. With procedural code this amounts to
> passing tests - good tests - and being confident that the code will
> break in useful ways when it does break. I don't see this as being any
> different to the intended outcome of valid CellML models that are
> purely declarative.
>
> At first glance it might seem that it is more taxing for a developer
> wanting to use CellML in their application if they need to handle
> external code; but this proposal for external code is very specific to
> the math declarations, and I think independent of whether the math is
> represented in MathML or as an external source of procedural code, the
> decisions of an application that are investigating the math are going
> to be difficult without sufficient annotation that tries to classify
> the math formulations in a way that a machine can filter what it is
> capable of and not capable of processing.
I deliberately don't address how to let the tools associate certain 
external code with a given code-identifier URI. Initially, this would 
have to be tool specific, but there could be another specification for 
this process in the future.
>  In some cases I imagine the
> application developer would welcome a particular math problem being
> already coded in a language that could be compiled an run. If that
> thought is continued, then there is a place for a model representation
> that has all math represented by external code, with the model
> structure being represented in CellML. This would obviously be under
> the assumption that some particular decisions for simulation of the
> model had been made; it is indeed a different scenario from the pure
> declarative model that seeks to explain the mathematical problem at a
> higher level and leave it to applications to resolve the simulation
> from this.
>
> At the moment we don't actually have a useful way for providing a
> cellml model with enough machine readable information for someone to
> rerun our model in exactly the same we as we had.
It is getting closer. PCEnv has its own non-standard meta-data for the 
exact algorithm used, but we haven't been able to agree on a more 
general standard way to represent algorithms with gracious fallback yet.
>  By referencing
> and/or including external code, we allow the step of exchanging a
> model at the simulation level, which is actually not a bad thing if
> our goal is to promote collaboration of model building.
>   
I don't know if this is true, as I am not suggesting that we allow the 
stepping (integrator) algorithm itself to be exchanged (although that 
could be a different, future specification if needed).
> I do think there is a possibility that people would abuse this; i.e.
> jump straight to binding bits of code here and there together with
> CellML; but if we maintain standards and best practice, then it should
> be easy to show them up. Also, perhaps we should trust people to
> evolve to only resorting to external code if it absolutely is the best
> way to solve their problem.
>
> There are a couple of things that we could possibly lose by bringing
> in external code:
> 1) producing human readable equations for publication that accurately
> reflect the mathematics in the model. Annotation of the algorithms or
> maths in the external code would help, but would not guarantee that
> the publication reflected exactly what was encoded in the model.
>   
It is better that we have some equations in MathML than no equations at 
all. In the cases which my BCP document is targeting, you probably would 
want to represent the model in the paper as a mixture of equations (from 
the MathML) and pseudo-code (which would probably be hand-written). The 
CellML and machine-readable code would ideally be referred to in a 
repository, or provided as a supplement.
> 2) ease of creating machine readable annotation for parts of the
> external code that would require it - for example under MIRIAM to bind
> each 'component' of a model to the relevant part of a reaction
> network. This is where you would be questioning the modeler as to
> whether their external code should be broken down and spread across
> models. But they may not have control over the external source, or,
> perhaps they are exchanging models that have necessarily lumped a lot
> of biological concepts into one piece of external code(library)
> because it's more efficient to solve that way; we now have a non
> MIRIAM compliant model.
>
> I would like to think including linking to external code in the CellML
> specification would push us to make a bigger effort on the procedures
> for model validation, and get more encouraging involvement of various
> modelers sitting out there with code that works; rather than thinking
> we will somewhere lose some high level elegance of CellML.
>   
I agree. Validation (i.e. testing) procedures for a specific class of 
CellML models containing external code (those containing machine 
learning techniques) is part of my PhD project.
> Specific comments (quoted pieces from
> http://www.cellml.org/Members/miller/bcp-external-models/ are enclosed
> in triple quotes)
>
> """[CellML] models are very good at describing complete mathematical
> models in a format which can be exchanged between model authors and
> users. This adds significant value to a model representation, because
> third parties can take the model, and use it in their preferred
> software packages to reproduce any results the author published."""
>
> Need some clear examples of model types that cannot be expressed in
> CellML, i.e. some algorithms that are best (or only) expressible at
> the moment in procedural code. I know that various neural network
> models and genetic algorithm based learning systems have evolved
> mainly from procedural thought. I think we need to really consider
> that some problems would be much better understood by model authors if
> they are expressed in procedural code.
>   
I'm not sure that we need this in the document, because the document is 
intended to provide best current practice guidelines. We could make a 
tutorial document (perhaps when the tools are better developed) 
describing examples of external code and how to make them work in the tools.


Examples:
1) Most machine learning techniques.
2) Stochastic sub-models, where there might not be a closed-form 
mathematical equation to go from the behaviour of individual parts of 
the model to the concentrations of each state.
3) Certain mathematical functions (e.g. those arising from various 
combinatorial problems) do not have a closed form, and will require 
external code to perform a numerical solve.

> """Having part of a model expressed in CellML, and other parts
> expressed in some more generic language is still useful, because it
> means that the common part of the model can be re-used more easily,
> either by providing external code of a different kind, or, where
> possible by replacing the external code with MathML."""
>
> If external code can be replaced with MathML, then why wouldn't this
> have been in a CellML component in the first place?
>   
Because the original model might need to use external code, but someone 
has proposed a new model which no longer needs it.
> I see a pro and con where someone encodes most of a model in an
> external code block bound into a single component of a model. The pro
> would be that maybe this has helped promote someone actually bothering
> to use cellml - as a first step, they simply wrapped their existing
> code; in this case it would be up to repository maintainers to
> encourage a breakdown of the model. The con of course is that we lose
> model structure into the external code, and there is no way we can
> automatically extract that. It is therefore effectively hidden until
> broken down - if that ever makes sense for the model.
>   
My guidelines try to discourage hiding model structure in external code 
when it is possible to do otherwise.

I agree that providing the tools to include external code would allow an 
incremental migration of models from procedural to CellML code. If this 
process is convenient for people, I don't see a problem with having 
transitional (non-published) models. I think that the current document 
encourages anyone doing this to complete the process to the maximum 
extent possible.
> """It is also hoped that this specification will encourage model
> developers to build up libraries of CellML accessible external code,
> which can be re-used in a range of CellML models, therefore increasing
> the range of modelling techniques available to CellML model
> authors."""
>
> I would see an open library of external code being very useful. There
> would need to be clear grading of that code, for example validating
> that code even compiles(if it needs to) and run on x,y,z platforms.
>   
There are lots of libraries like that out there already. Do you mean 
something CellML specific? If so, I think that would be useful, but it 
is a bit early now, because tools need to develop interfaces to external 
code first, and start to standardise that a bit more. I don't think that 
is a pre-requisite for agreeing on an external code document.
> """Best practice guidelines for CellML document authors"""
>
> """1.   External code should be used only where a part of a model cannot
> be adequately expressed in CellML. External code is often
> non-portable, and using it reduces the re-usability of your model, and
> so it should only be used when needed."""
>
> yes
>
> """2. External code should only perform the calculations that CellML
> is unable to perform, with the rest of the calculations expressed as
> MathML, in the CellML model. This is important, because increasings
> the fraction of your model can be more easily re-used by other
> modellers. It also means that CellML editing and visualisation
> software will allow your model to be edited and visualised better."""
>
> yes and no. I don't think representing in MathML offers any more ease
> for re-use unless you are all sharing a prescribed subset of MathML
> and agree on the acceptable forms of equations if algebraic
> manipulation is limited or not possible.
>   
We do have a CellML-subset of MathML (which functions you can call). I 
agree that the declarative use of MathML is somewhat limited by the fact 
that most (all?) CellML tools can't do any symbolic algebra. However, as 
the very least it becomes possible to perform Newton-Raphson solves (you 
could potentially do this with procedural code too, of course, but you 
would at least want the code to be broken up into minimal functions so 
you don't have to do a multivariate solve, as requested in 3). However, 
just re-ordering equations will allow us some potential for re-use, 
especially for simple equations like x = y, or linear combinations of 
other variables (which it would be quite reasonable for CellML software 
to be able to manipulate, should there be sufficient demand for this).

> """3. Modellers should, where feasible, separate external code into as
> many different sub-functions as possible. For example, if you have
> external code to compute y1 from x1 and x2, and y2 from x1 and x2, you
> should write this as two separate external function applications,
> unless there is a compelling reason to do otherwise (such as is the
> case if it is much more efficient to compute them together). Doing
> this makes it easier to modify the CellML model in the future, and
> allows the CellML processing software to determine the order in which
> expressions are evaluated, making your model more flexible."""
>
> see above ... the compromise will always be the amount of information
> you can extract out of the model for other purposes - for example for
> model reuse, for simply visualizing and understanding the makeup of
> the model, for publication. It could be compelling enough for people
> to produce at least one highly broken down model along with the one
> fitted for optimization.
>
> """4. External code should, by itself, meet  [MIRIAM] requirements 1
> and 2. This means that the external code should be encoded in a
> public, machine-readable format, and it should be valid and
> compilable."""
>
> It should meet all the criteria of MIRIAM compliance as part of being
> a model on the whole.
Together with guideline 5, that is still implicitly required. I added 
this guideline in because we don't want to encourage people to publish 
their CellML XML, but not include the external code in an adequate 
format, and then claim that they have shared their model.

The external code by itself isn't a model, so can't comply with the rest 
of the MIRIAM requirements. If people are publishing the model, the 
model as as a whole should be MIRIAM compliant, but that is a general 
concern, and is not specific to this document (hence it wouldn't make 
sense to give it as a guideline). I have tried to choose the 
best-practice guidelines so that modellers who follow them, as well as 
the CellML specification, will create a maximally useful, MIRIAM 
compliant model. I think that the goal as it is now is useful for this.

>  The test cases are going to be very important I
> think in assuring the quality of external code.
I'm not sure that testing the external code in isolation would always be 
useful, simply because there may not be any data about what inputs the 
external code should produce for a given output (i.e. the external code 
can only be tested by how well it works in the context of the entire model).
>  You might make the
> case the external code is wrapped in its own model which itself would
> need to be fully MIRIAM compliant. The MIRAM document is a bit weak
> around the edges of things like validation and the annotation of
> 'components' of a model.
Hence why we need this guideline, which essentially requests that 
external code is written in a real programming language, and is valid 
under the rules of that programming language.
>  I think we need to be clear about what
> validation is necessary for models that reference external code.
>   
This is no different from the standard rules for validation (although we 
do of course need to be careful in the machine learning case that we are 
learning biology and not the same data we are testing with. However, 
this is a concern for another document, and could still apply to pure 
CellML models).
> I would still like more clarification of how important MIRIAM is to
> this; especially in that I think the requirements of MIRIAM haven't
> really been designed with typical procedural code examples in mind. I
> don't think MIRIAM can't cope with it.
>
> """5. The external code should be treated as part of the model. When a
> model represented in CellML is published, the external code should be
> published alongside it, unless it is part of a generally available
> library of external code."""
>
> The latter part worries me a little. Enter license bewilderment. But see 6.
>
> """6. The definitionURL used on csymbol elements should be a URL under
> the control of the author. It is not necessary for there to actually
> be a document accessible at the URL, as it is merely intended as a
> unique identifier."""
>
> What happens with multiple authors? Will an author always guarantee a
> method for creating a URL? I think this problem is related to 5. For
> example, if the source code for an external component is submitted to
> a repository and becomes licensed according to that, then the URL
> should probably be related to that. So I think ultimately the domain
> that wants to guarantee that the source is perpetually available
> should be the domain that forms the base of the URL.
>   
I meant author of the procedural code, not author of the CellML model. 
If the author of the procedural code doesn't provide a URL, the model 
author may need to make one under their control. I guess it would be 
worth clarifying that more.

If the code has been written by one author, and then subsequently taken 
and modified by another, it should be the second author who specifies 
the URL, because we want to identify the modified version, not the original.

The system is intended to simply provide a unique URI for the code. 
There is not supposed to be any implication that the code can be fetched 
from any location. The reason for requiring the domain be under the 
control of the 'author' is simply to avoid collisions.

Perhaps:
"The definitionURL used on csymbol elements should be a URL which 
uniquely identifies the external code being used. It is not necessary 
for there to actually be a document accessible at the URL, as it is 
merely intended as a unique identifier. If no existing URL has been 
assigned for the particular version of the external code being 
referenced, a new URL may be assigned. To avoid collisions, the person 
or entity assigning a URL should choose a URL under their control."

That way, if a repository allocates URLs for external code, model 
authors could use that URL, which would help to ensure that commonly 
used external code has a single well-known URL.
> cheers
> Matt
>
>   

_______________________________________________
cellml-discussion mailing list
[email protected]
http://www.cellml.org/mailman/listinfo/cellml-discussion

Re: [cellml-discussion] Proposal: BCP for including external codeinCellML models

Reply via email to