I'm posting this naive proposal for an XML based functional-programming style of bioinformatics language , or collection of languages, to the main open-bio lists I am familiar with to try to find out if anybody else is interested in thinking about a non-naive proposal, or knows somebody who might be , or is already doing so.
For very tentative examples of the sort of thing I have in mind, see Examples 1. and 2. below. (For one overview of functional programming languages and XML see for example http://www.xml.com/pub/a/2001/02/14/functional.html) In what follows an XML based functional-programming style of bioinformatics language is referred to as an XGLT (i.e. XSLT-with-a-G, "Genetic Transformation Language", for want of a better term, though its not really related to Genetics specifically, so the G is moot). The main ideas initially are that such a language would * provide a high-level implementation-independent interface to the rich Object Oriented (O-O) libraries (BioJava , BioPERL, BioPython and others), more accessible to non-experts, and to developers working in other environments. XGLT interpreters could be developed using these libraries. * provide an alternative "constructive" way of representing biological sequence and other data. An XGLT based data packet would in general express how to (reconstruct) a given piece of biological sequence data - e.g. a sequence, or a consensus alignment of sequences ,or a translation - rather than convey the data itself, or any particular model of the data. While initially limited to sequence data , it is possible such a functional programming dual may find application to other biological data. Such languages would have the following benefits 1) They would enable reference to and exchange of large complex data structures , such as alignments, in a succinct form, and very suitable for further manipulation. (Example 1 below) 2) Because such languages would in most cases exchange statements about how to (re)construct data , rather than the data itself ,they would convey valuable information lost when only the end results are transmitted - as an example, any indels made in a DNA sequence read as part of its protein translation. (Example 2 below) 3) Such languages could potentially provide a convenient higher-level more declarative style of functional programming interface to Object Oriented libraries , such as BioJava, BioPerl, BioPython and others, as these O-O libraries could be used to write the XGLT engines required to actually interpret and execute XGLT statements. 4) A functional programming style lends itself more readily to expressing a chain of processing steps , i.e. a (mini-) pipeline, than does an Object Oriented system , which is more expressive of static structure.See example 2 below for a very simple/naive example of a micro-translation-pipeline expressed as a nested series of transforms in an XGLT. 5) This point is related to both point 3) and point 4) above. It is likely that one popular method of making Bioinformatic software libraries such as the Bio* projects accessible to the non-expert and/or non-Java/Perl/Python user will be to build Web Services directories (WSDL), with each service mapping to a static Bio* facade method, that internally creates temporary Bio* Objects to execute the service method. However this approach is really limited to one-shot services. Where a task calls for a series of services to be invoked in a pipeline, the fact that the underlying Bio* objects do not persist between calls is a problem , which would require expensive marshalling of output and input between web-service calls. The combination of an XGLT language allowing a non-expert user to specify a nested series of processing steps in a high-level implementation-independent manner, with an XGLT interpreter/engine written using one or more of the well engineered rich O-O-based Bio* libraries, would potentially allow the entire pipeline to be executed within the O-O based engine, with objects persisting as and when required, for the entire pipeline process. 6) an advantage of making a functional-programming representation XML based , is that in many cases the representation would not need to be interpreted by a real XGLT interpreter to be useful. For example it is easy to use XSLT to transform Example 1. below , into something like an SVG (http://www.w3.org/TR/SVG/) based display of the patterns of variation in an alignment, without even actually executing the various editing steps required to construct the reads. An XGLT dual of a protein reference sequence , as in Example ,. includes enough information to plot a rich feature track on a genome viewer, without actually executing the translation. Finally ,it would be desirable to provide some sort of theoretical context for the suggestions and examples presented here , and so I give a very tentative one. Comparing the two representations of an alignment of sequences in Example 1, both contain the same information, but one (the XGLT version) is projected into a space of functions, and the other into a geometric space. This is analogous to the duality betwen the time-domain and frequency domain representations of a mathematical function or data series. (Another analogy is with the duality between a vector space and the dual-space of linear functionals defined on that space) Others have pointed out a duality relationship between Object Oriented and Functional Programming languages. So the tentative theoretical context , is that expressions in XGLT languages would amount to almost formal duals of the original data and models. Therefore I would suggest the XGLT representation of something like an alignment (Example 1) or protein translation (Example 2) , be referred to as the "XGLT dual" or "functional programming dual" of the original , to emphasize that we are really dealing with the same information , but projected into a different space - one of functions. And just as working in the frequency domain can sometimes be a productive thing to do with a mathematical function or data series, so working in a dual functional-programming domain as suggested here may be productive for some purposes. I'd be grateful for any feedback (however harsh !) on my admittedly very naive proposal. Cheers Alan McCulloch --------------------------------------------------------------------- Example 1 --------------------------------------------------------------------- Set out below is a possible XGLT dual of the following alignment fragment : >Contig1 CGATCGAGCGTG read1 CGATCCGAGCGTG read2 GATC-GAGCGTG read3 GACC-AGGGTT read4 GACC-GAGCGT read5 ATC-GA ------------- CGATC-GAGCGTG <!-- this is an XGLT functional-programming dual of an alignment of reads making up a contig. Rather than literally presenting the contig, consensus and alignments, it gives instructions for how to construct the consensus given the contig, and then for constructing each read from the consensus - i.e.working backwards. --> <mydata xmlns:xglt="www.pretend.xglt.org/XGLT-version-1.html" xmlns:xbiopath="www.pretend.xglt.org/xbiopath-version-1.html" xmlns:xseqedit="www.pretend.xglt.org/xseqedit-version-1.html" xmlns:xprotein="www.pretend.xglt.org/xprotein-version-1.html"> <!-- provide the contig starting point --> <contig1> CGATCGAGCGTG </contig1> <!-- a transform to obtain the consensus --> <xglt:transform name="consensus"> <xbiopath:copy_sequence source="../contig1"/> <xseqedit:insert from="5" value="gap()" count="1"/> </xglt:transform> <!-- transforms to obtain each read from the consensus- we only need to specify changes from the consensus. Each transform first calls the above consensus transform, to provide its starting point (an XGLT interpreter engine would of course optimise such multiple calls away in the actual execution) --> <xglt:transform name="read1"> <xglt:apply_transform name="../consensus"/> <xseqedit:substitute from="6" to="6" value="C"/> </xglt:transform> <xglt:transform name="read2"> <xglt:apply_transform name="../consensus"/> <xseqedit:substitute from="1" to="1" value="null()"/> </xglt:transform> <xglt:transform name="read3"> <xglt:apply_transform name="../consensus"/> <xseqedit:substitute from="1" to="2" value="null()"/> <xseqedit:substitute from="3" to="3" value="G"/> <xseqedit:substitute from="4" to="4" value="A"/> <xseqedit:substitute from="6" to="6" value="C"/> <xseqedit:substitute from="7" to="7" value="gap()"/> <xseqedit:substitute from="10" to="10" value="G"/> <xseqedit:substitute from="13" to="13" value="T"/> </xglt:transform> <xglt:transform name="read4"> <xglt:apply_transform name="../consensus"/> <xseqedit:substitute from="1" to="1" value="null()"/> <xseqedit:substitute from="4" to="4" value="C"/> <xseqedit:substitute from="13" to="13" value="null()"/> </xglt:transform> <xglt:transform name="read5"> <xglt:apply_transform name="../consensus"/> <xseqedit:substitute from="1" to="2" value="null()"/> <xseqedit:substitute from="9" to="13" value="null()"/> </xglt:transform> </mydata> ------------------------------------------------------------------------ --------------- Example 2 ------------------------------------------------------------------------ --------------- <!-- this is an XGLT functional-programming dual of a hypothetical RefSeq protein sequence, that has undergone a curated translation from an underlying read (hg11 genome say) that contains errors. Rather than presenting the literal end-product sequence, this dual gives instructions for how to construct it. When processed by an XGLT interpreter/engine, the end result would simply be the RefSeq protein sequence --> <xglt:transform name="myRefSeqProtein" xmlns:xglt="www.pretend.xglt.org/XGLT-version-1.html" xmlns:xbiopath="www.pretend.xglt.org/xbiopath-version-1.html" xmlns:xseqedit="www.pretend.xglt.org/xseqedit-version-1.html" xmlns:xprotein="www.pretend.xglt.org/xprotein-version-1.html"> <!-- this transform retrieves 3 exons from hg11 and concatenates them into a single string --> <xglt:transform name="getMyRefseqExons"> <xbiopath:extract_sequence target="hg11"> <xbiopath:subseq start="chr3.12345" stop="chr3.12545"/> <xbiopath:subseq start="chr3.23456" stop="chr3.23656"/> <xbiopath:subseq start="chr3.34567" stop="chr3.34667"/> <xglt:concatenate xref="./workspace()"/> </xbiopath:extract_sequence> </xglt:transform> <!-- this transform calls the above transform to retrieve sequence,and then applies some edits --> <xglt:transform name="myCuratedRefSeq"> <xglt:apply_transform name="../getMyRefseqExons"/> <xseqedit:delete from="100" to="110"/> <xseqedit:insert from="50" value="G" count="1"/> <xseqedit:substitute from="200" to="200" value="G"/> </xglt:transform> <!-- this transform calls the above transform to supply a DNA sequence , and then translates it --> <xglt:transform name="translation"> <xglt:apply_transform name="../myCuratedRefSeq"/> <xprotein:translate species="human"/> </xglt:transform> </xglt:transform> ------------------------------------------------------------------------ -------------- Comment on Above Examples ------------------------------------------------------------------------ ------------- In these examples I have... 1) ...tried to suggest a functional style of programming, but an actual XGLT may look quite different. Transformations are declared and referenced inside other transformations, in a nested structure. Each transform stands alone , in that it first calls another transform that provides its starting point (and this transform may in turn involve a call to another transform, etc) 2) ...tried to demonstrate how an XGLT would convey valuable information about (in this example) the way the RefSeq was made, not just the sequence of the RefSeq itself. We not only achieve a succinct and in this case compressed expression of the actual sequence of the RefSeq, we also have an audit-trail of how the RefSeq was curated. 3) ...supposed that rather than a single xglt language/name-space, there would be a collection of namespaces such as xglt: basic language for expressing things in a functional programming manner - defining and referencing transforms etc. xbiopath: functions for referencing and extracting biological sequences from databases and genomes. The example given in (1) is a simple coordinate based extract , but one could also envisage specifying things like similarity based paths.... <xbiopath:match_sequence query="../myCuratedRefSeq()" method="blast -e 1.0e-30" target="hg15" offset=-2000 length=2500/> - this would result in the extraction of 2.5Kb sections of sequence, from all positions 2Kb upstream of any hg15 hits to the RefSeq that was constructed in example 1. xprotein: functions for working with protein primary and secondary structure xseqedit: basic functions for sequence editing. This example shows indels and changes - one can also envisage , say, masking and quality trimming functions that could be specified in a transform, as part of a pipeline. 4) noted that one would also want to be able to use XPath-ish (http://www.w3.org/TR/xpath) references, to other parts of the current or other XGLT documents. ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= _______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l