Hi,

I am working on developing a CellML model (using external code) of 
transcriptional control  in yeast which is 23 MB in size. I hope to 
eventually do a similar thing for organisms which have much more 
complicated sets of interactions, in which case this size may grow 
substantially.

If anyone on this list is interested in similar problems (I presume 
similar issues come up in a range of systems biology problems, whether 
you are working with CellML or SBML), I would welcome your feedback and 
suggestions, and perhaps we could collaborate .

This creates some unique issues for CellML processing tools:
1) Just parsing the CellML model (especially with a DOM-type parser 
which stores all the nodes into a tree, but probably with any type of 
parser) is very slow.
2) The CellML model might not all fit in memory at the same time, 
especially if the model gets to be multi-gigabyte. It might be possible 
to make use of swap to deal with this, but if the algorithms don't have 
explicit control over when things are swapped in and out, it will be 
hard to work with such a model.
3) The CellML model is much larger than it needs to be, which makes it 
inconvenient to exchange with third parties.
4) The current CellML API implementation has been designed for maximum 
flexibility ('one size fits all'), but this flexibility (e.g. supporting 
live iterators, access to arbitrary extension elements, and so on) is 
expensive for very large models. Much of this expensive functionality is 
probably unnecessary for most tools, although what is and is not 
necessary depends on the tool being used.

In practice, nearly all existing CellML specific tools handle the file 
badly. For example, PCEnv runs out of memory if you try to load the 
file, while Jonathan Cooper's CellML validator just sits at 100% of a 
single CPU for a long time (at least 15 minutes on my system, and still 
running at the time of writing, but the time will obviously depend on 
system speed).

There are some possible ways to improve on this:
A) There are ways to generate ASN.1 schemata from the XML schemata. This 
could be used to produce an efficient (in terms of both data size and 
parse time) binary representation of CellML, with the possibility to 
convert back to CellML. For example, Fast Infoset, and the similar (but 
non ASN.1 based) BiM.
B) A database-based representation of a large CellML model could be 
used, either through an XML-enabled database, or more likely, some 
mapping layer. This would allow the model to be loaded into the database 
once, and the relevant parts retrieved from the on-disk database as 
required, in an algorithmically sensible way. It is worth noting that my 
model is generated using data from a relational database (a process 
which takes up to a minute), but I would like the next step of my 
pipeline to generalise to other CellML inputs.
C) Another leaner API, read-only CellML API (perhaps based off the same 
IDLs, but with certain functionality, like the ability to modify the 
model, or set mutation event listeners, unavailable). We could add a 
SAX-style event dispatcher instead, to allow users to save any 
information they do want from extension elements, which will also not be 
kept in the model. Comments, white-space, and so on would all be 
stripped unlike in the current CellML API implementation. Tools which 
are currently using the full CellML API but only require read-only 
access (e.g. the CCGS) might be able to just 'flick the switch' and 
benefit from the leaner API.

I would welcome any opinions, comments, suggestions, or collaborations 
on this.

Best regards,
Andrew

_______________________________________________
cellml-discussion mailing list
[email protected]
http://www.cellml.org/mailman/listinfo/cellml-discussion

Reply via email to