In message <[EMAIL PROTECTED]> CellML Discussion List 
<cellml-discussion@cellml.org> writes:
> Hi,
> 
> I am working on developing a CellML model (using external code) of 
> transcriptional control  in yeast which is 23 MB in size. I hope to 
> eventually do a similar thing for organisms which have much more 
> complicated sets of interactions, in which case this size may grow 
> substantially.
> 
> If anyone on this list is interested in similar problems (I presume 
> similar issues come up in a range of systems biology problems, whether 
> you are working with CellML or SBML), I would welcome your feedback and 
> suggestions, and perhaps we could collaborate .
> 
> This creates some unique issues for CellML processing tools:
> 1) Just parsing the CellML model (especially with a DOM-type parser 
> which stores all the nodes into a tree, but probably with any type of 
> parser) is very slow.
> 2) The CellML model might not all fit in memory at the same time, 
> especially if the model gets to be multi-gigabyte. It might be possible 
> to make use of swap to deal with this, but if the algorithms don't have 
> explicit control over when things are swapped in and out, it will be 
> hard to work with such a model.
> 3) The CellML model is much larger than it needs to be, which makes it 
> inconvenient to exchange with third parties.
> 4) The current CellML API implementation has been designed for maximum 
> flexibility ('one size fits all'), but this flexibility (e.g. supporting 
> live iterators, access to arbitrary extension elements, and so on) is 
> expensive for very large models. Much of this expensive functionality is 
> probably unnecessary for most tools, although what is and is not 
> necessary depends on the tool being used.

This might be an area in which lazy evaluation and and 
parametrically-polymorphic programming might offer some advantages. In 
particular, with lazy evaluation the data structures are only constructed and 
arguments evaluated only when needed. In fact, demand-driven programming allows 
finiste results to be obtained from infinite data structures. This might 
alleviate having to store the entire model in memory at one time. In addition, 
in many cases intermediate data structures, though implied in the program, may 
not have to actually exist.

 
> In practice, nearly all existing CellML specific tools handle the file 
> badly. For example, PCEnv runs out of memory if you try to load the 
> file, while Jonathan Cooper's CellML validator just sits at 100% of a 
> single CPU for a long time (at least 15 minutes on my system, and still 
> running at the time of writing, but the time will obviously depend on 
> system speed).
> 
> There are some possible ways to improve on this:
> A) There are ways to generate ASN.1 schemata from the XML schemata. This 
> could be used to produce an efficient (in terms of both data size and 
> parse time) binary representation of CellML, with the possibility to 
> convert back to CellML. For example, Fast Infoset, and the similar (but 
> non ASN.1 based) BiM.
> B) A database-based representation of a large CellML model could be 
> used, either through an XML-enabled database, or more likely, some 
> mapping layer. This would allow the model to be loaded into the database 
> once, and the relevant parts retrieved from the on-disk database as 
> required, in an algorithmically sensible way. It is worth noting that my 
> model is generated using data from a relational database (a process 
> which takes up to a minute), but I would like the next step of my 
> pipeline to generalise to other CellML inputs.
> C) Another leaner API, read-only CellML API (perhaps based off the same 
> IDLs, but with certain functionality, like the ability to modify the 
> model, or set mutation event listeners, unavailable). We could add a 
> SAX-style event dispatcher instead, to allow users to save any 
> information they do want from extension elements, which will also not be 
> kept in the model. Comments, white-space, and so on would all be 
> stripped unlike in the current CellML API implementation. Tools which 
> are currently using the full CellML API but only require read-only 
> access (e.g. the CCGS) might be able to just 'flick the switch' and 
> benefit from the leaner API.
> 
> I would welcome any opinions, comments, suggestions, or collaborations 
> on this.

Perhaps this is where a Haskell API might be very useful.

> Best regards,
> Andrew
> 
> _______________________________________________
> cellml-discussion mailing list
> cellml-discussion@cellml.org
> http://www.cellml.org/mailman/listinfo/cellml-discussion
> 
_______________________________________________
cellml-discussion mailing list
cellml-discussion@cellml.org
http://www.cellml.org/mailman/listinfo/cellml-discussion

Reply via email to