Re: [cellml-discussion] Using CellML to represent huge CellML models: Has anyone worked on this already?

2007-04-25 Thread Matt
So the main issues you are describing so far are:

- the time taken to deserialize a model into a programming environment
- the time taken to perform operations over the model and especially
concerned about changes happening to the model while these operations
are being performed

My assumption is that you want to maintain the current behaviour of
the cellml API which does assume that changes can happen at any time
but does not assume (but does advertise) that a model has been
completely loaded if we are dealing with imports.

A couple of questions I have:

1) How important is it to you to have speedy deserialisation? For
example, if the model was stored in a database then you may hardly
ever need to serialise and deserialize from the CellML/XML form.

2) How important is it for the model to be able to be updated at any
time? For example, if the model was write-locked during these time
expensive operations, or during a call to a generator, then would this
be too restricting?

My own inclination is that using a database back end is the natural
step to make if you want to make this scalable. I presume you want
your upper maximum in tests to be in the gigabyte range with obvious
pointers to terabytes?

A database backend would make a nice addition to the implementation of
the API as an optional storage backend.

Addressing write-locks would become a more general problem to resolve.

cheers
Matt


On 4/26/07, Andrew Miller <[EMAIL PROTECTED]> wrote:
> Randall Owen wrote:
> > In message <[EMAIL PROTECTED]> CellML Discussion List 
> >  writes:
> >
> >> Hi,
> >>
> >> I am working on developing a CellML model (using external code) of
> >> transcriptional control  in yeast which is 23 MB in size. I hope to
> >> eventually do a similar thing for organisms which have much more
> >> complicated sets of interactions, in which case this size may grow
> >> substantially.
> >>
> >> If anyone on this list is interested in similar problems (I presume
> >> similar issues come up in a range of systems biology problems, whether
> >> you are working with CellML or SBML), I would welcome your feedback and
> >> suggestions, and perhaps we could collaborate .
> >>
> >> This creates some unique issues for CellML processing tools:
> >> 1) Just parsing the CellML model (especially with a DOM-type parser
> >> which stores all the nodes into a tree, but probably with any type of
> >> parser) is very slow.
> >> 2) The CellML model might not all fit in memory at the same time,
> >> especially if the model gets to be multi-gigabyte. It might be possible
> >> to make use of swap to deal with this, but if the algorithms don't have
> >> explicit control over when things are swapped in and out, it will be
> >> hard to work with such a model.
> >> 3) The CellML model is much larger than it needs to be, which makes it
> >> inconvenient to exchange with third parties.
> >> 4) The current CellML API implementation has been designed for maximum
> >> flexibility ('one size fits all'), but this flexibility (e.g. supporting
> >> live iterators, access to arbitrary extension elements, and so on) is
> >> expensive for very large models. Much of this expensive functionality is
> >> probably unnecessary for most tools, although what is and is not
> >> necessary depends on the tool being used.
> >>
> >
> > This might be an area in which lazy evaluation and and 
> > parametrically-polymorphic programming might offer some advantages. In 
> > particular, with lazy evaluation the data structures are only constructed 
> > and arguments evaluated only when needed.
> The code currently uses top-down dynamic programming, which is
> essentially the same thing here (i.e. it computes things like iterators
> exactly as they are needed, but it caches any data it can until it is
> invalidated by a global serial number is changed).
> >  In fact, demand-driven programming allows finiste results to be obtained 
> > from infinite data structures. This might alleviate having to store the 
> > entire model in memory at one time. In addition, in many cases intermediate 
> > data structures, though implied in the program, may not have to actually 
> > exist.
> >
> The way that these things currently work internally is as follows:
> 1) The entire XML file is parsed into our internal DOM representation,
> all in one go.
> This step happens synchronously and in one go at load time. To improve
> on this, we would have to keep referring back to the XML file (which
> could be a network resource) every time we required more data from the
> model. We could cache the file to avoid parsing it, but XML is not a
> very good format for random access, so I suspect this would cause worse
> performance problems, unless we had a binary format optimised for this.
>
> 2) A single top-level object for the model is created. This object is
> created in constant time when the model is loaded.
>
> Using this top-level object, it is then possible to iterate over the
> contents of the model. At this po

Re: [cellml-discussion] Using CellML to represent huge CellML models: Has anyone worked on this already?

2007-04-25 Thread Andrew Miller
Randall Owen wrote:
> In message <[EMAIL PROTECTED]> CellML Discussion List 
>  writes:
>   
>> Hi,
>>
>> I am working on developing a CellML model (using external code) of 
>> transcriptional control  in yeast which is 23 MB in size. I hope to 
>> eventually do a similar thing for organisms which have much more 
>> complicated sets of interactions, in which case this size may grow 
>> substantially.
>>
>> If anyone on this list is interested in similar problems (I presume 
>> similar issues come up in a range of systems biology problems, whether 
>> you are working with CellML or SBML), I would welcome your feedback and 
>> suggestions, and perhaps we could collaborate .
>>
>> This creates some unique issues for CellML processing tools:
>> 1) Just parsing the CellML model (especially with a DOM-type parser 
>> which stores all the nodes into a tree, but probably with any type of 
>> parser) is very slow.
>> 2) The CellML model might not all fit in memory at the same time, 
>> especially if the model gets to be multi-gigabyte. It might be possible 
>> to make use of swap to deal with this, but if the algorithms don't have 
>> explicit control over when things are swapped in and out, it will be 
>> hard to work with such a model.
>> 3) The CellML model is much larger than it needs to be, which makes it 
>> inconvenient to exchange with third parties.
>> 4) The current CellML API implementation has been designed for maximum 
>> flexibility ('one size fits all'), but this flexibility (e.g. supporting 
>> live iterators, access to arbitrary extension elements, and so on) is 
>> expensive for very large models. Much of this expensive functionality is 
>> probably unnecessary for most tools, although what is and is not 
>> necessary depends on the tool being used.
>> 
>
> This might be an area in which lazy evaluation and and 
> parametrically-polymorphic programming might offer some advantages. In 
> particular, with lazy evaluation the data structures are only constructed and 
> arguments evaluated only when needed.
The code currently uses top-down dynamic programming, which is 
essentially the same thing here (i.e. it computes things like iterators 
exactly as they are needed, but it caches any data it can until it is 
invalidated by a global serial number is changed).
>  In fact, demand-driven programming allows finiste results to be obtained 
> from infinite data structures. This might alleviate having to store the 
> entire model in memory at one time. In addition, in many cases intermediate 
> data structures, though implied in the program, may not have to actually 
> exist.
>   
The way that these things currently work internally is as follows:
1) The entire XML file is parsed into our internal DOM representation, 
all in one go.
This step happens synchronously and in one go at load time. To improve 
on this, we would have to keep referring back to the XML file (which 
could be a network resource) every time we required more data from the 
model. We could cache the file to avoid parsing it, but XML is not a 
very good format for random access, so I suspect this would cause worse 
performance problems, unless we had a binary format optimised for this.

2) A single top-level object for the model is created. This object is 
created in constant time when the model is loaded.

Using this top-level object, it is then possible to iterate over the 
contents of the model. At this point, CellML element objects are set up, 
and a mapping is set up between these CellML element objects and their 
corresponding DOM element objects. This mapping is permanent in the 
sense that it persists even if the model is changed (however, the 
mapping is removed when the CellML element is deleted).

3) Additional objects are created and cached as required, but this cache 
is subject to expiry if the latest serial number (bumped up each time a 
change is made) doesn't match the serial number on the cache.

I think any approach to do better has to be based upon bottom-up dynamic 
programming. This means that we need to know something about what 
information we will need before we use it (which is not so great for a 
general API, which is why we need a second, slimmed down version).
Things we can do:
  a) Not storing information which is not needed for a particular tool 
(need to specify what information is needed somehow, or perhaps more 
simply, just target the slimmed-down API at certain classes of tools, 
and exclude e.g. editors, which need all information so they can 
re-serialise the model without loss).
  b) Performing some calculations early, with some view of what 
information is required. This allows, e.g. throwing away any data we 
know we won't need.

The current API has been built assuming that the model could be changed 
at any point, and so it tries not to do any more work in terms of saving 
data than it needs to, and that any changes need to be re-serialised out.

Examples of major issues:
=> Not much indexing i

Re: [cellml-discussion] Using CellML to represent huge CellML models: Has anyone worked on this already?

2007-04-25 Thread Randall Owen
In message <[EMAIL PROTECTED]> CellML Discussion List 
 writes:
> Hi,
> 
> I am working on developing a CellML model (using external code) of 
> transcriptional control  in yeast which is 23 MB in size. I hope to 
> eventually do a similar thing for organisms which have much more 
> complicated sets of interactions, in which case this size may grow 
> substantially.
> 
> If anyone on this list is interested in similar problems (I presume 
> similar issues come up in a range of systems biology problems, whether 
> you are working with CellML or SBML), I would welcome your feedback and 
> suggestions, and perhaps we could collaborate .
> 
> This creates some unique issues for CellML processing tools:
> 1) Just parsing the CellML model (especially with a DOM-type parser 
> which stores all the nodes into a tree, but probably with any type of 
> parser) is very slow.
> 2) The CellML model might not all fit in memory at the same time, 
> especially if the model gets to be multi-gigabyte. It might be possible 
> to make use of swap to deal with this, but if the algorithms don't have 
> explicit control over when things are swapped in and out, it will be 
> hard to work with such a model.
> 3) The CellML model is much larger than it needs to be, which makes it 
> inconvenient to exchange with third parties.
> 4) The current CellML API implementation has been designed for maximum 
> flexibility ('one size fits all'), but this flexibility (e.g. supporting 
> live iterators, access to arbitrary extension elements, and so on) is 
> expensive for very large models. Much of this expensive functionality is 
> probably unnecessary for most tools, although what is and is not 
> necessary depends on the tool being used.

This might be an area in which lazy evaluation and and 
parametrically-polymorphic programming might offer some advantages. In 
particular, with lazy evaluation the data structures are only constructed and 
arguments evaluated only when needed. In fact, demand-driven programming allows 
finiste results to be obtained from infinite data structures. This might 
alleviate having to store the entire model in memory at one time. In addition, 
in many cases intermediate data structures, though implied in the program, may 
not have to actually exist.

 
> In practice, nearly all existing CellML specific tools handle the file 
> badly. For example, PCEnv runs out of memory if you try to load the 
> file, while Jonathan Cooper's CellML validator just sits at 100% of a 
> single CPU for a long time (at least 15 minutes on my system, and still 
> running at the time of writing, but the time will obviously depend on 
> system speed).
> 
> There are some possible ways to improve on this:
> A) There are ways to generate ASN.1 schemata from the XML schemata. This 
> could be used to produce an efficient (in terms of both data size and 
> parse time) binary representation of CellML, with the possibility to 
> convert back to CellML. For example, Fast Infoset, and the similar (but 
> non ASN.1 based) BiM.
> B) A database-based representation of a large CellML model could be 
> used, either through an XML-enabled database, or more likely, some 
> mapping layer. This would allow the model to be loaded into the database 
> once, and the relevant parts retrieved from the on-disk database as 
> required, in an algorithmically sensible way. It is worth noting that my 
> model is generated using data from a relational database (a process 
> which takes up to a minute), but I would like the next step of my 
> pipeline to generalise to other CellML inputs.
> C) Another leaner API, read-only CellML API (perhaps based off the same 
> IDLs, but with certain functionality, like the ability to modify the 
> model, or set mutation event listeners, unavailable). We could add a 
> SAX-style event dispatcher instead, to allow users to save any 
> information they do want from extension elements, which will also not be 
> kept in the model. Comments, white-space, and so on would all be 
> stripped unlike in the current CellML API implementation. Tools which 
> are currently using the full CellML API but only require read-only 
> access (e.g. the CCGS) might be able to just 'flick the switch' and 
> benefit from the leaner API.
> 
> I would welcome any opinions, comments, suggestions, or collaborations 
> on this.

Perhaps this is where a Haskell API might be very useful.

> Best regards,
> Andrew
> 
> ___
> cellml-discussion mailing list
> cellml-discussion@cellml.org
> http://www.cellml.org/mailman/listinfo/cellml-discussion
> 
___
cellml-discussion mailing list
cellml-discussion@cellml.org
http://www.cellml.org/mailman/listinfo/cellml-discussion