Hi out there! Before now, I had some non-list conversation with Matthew, Thomas and David. But this conversation is further best carried out here, I think. I am working on a parser for cytogenetic data, notated in ISCN (this is the nomenclature for cytogenetic data). The vision at last is to handle cytogenetic data the same way as other sequence data. Cytogenetic data is a wealth of information, count alone the already existing cytogenetic records combined with cell (most cancer-) phenotypes. One of the outputs of this parser should be annotated biojava sequence objects. One of the intermediary products should be something like a 'CytoML', for which I have a layout, but is not published yet. In my mind, this CytoML will, via some baseURI-mechanism, reference instances of an AlphabetML. AlphabetML as an XML-dialect to describe alphabets, and their symbol-substitution logic(ambiguity and abbrevation). This way CytoML could be independent of the described resolution of bands and also of the described organism. For example, take human cytogenetic loci and make a symbol of every locus. So we get symbols with name '1', '2', and further on symbols with name '1p', '1cen', '1q', '1p1' and so on. I decided to have every cytogenetic locus be a symbol in this alphabet, and not the product of a number- and a {'p','cen','q'}-alphabet, to reflect the biological nature of the loci, as they are not a combination of anything, but each is unique in its sequence. The extra benefit of having an AlphabetML is that biojava-Alphabet-objects could be generated from the very same AlphabetML-instances that are referenced from a specific CytoML-(or even other formats)-instance. And here comes biojava and my problems with Symbols and Alphabets. Specific example: Cytogenetic locus symbol '1' is an ambiguity over the two sequences {'1p','1cen','1q'} and {'1q','1cen','1p'}. In my understanding of biojava, a BasisSymbol at last has two methods to specifiy this. - getMatches() - to reflect ambiguity - and getSymbols() - to reflect 'abbrevation' of sequences. both return a set (this case an alphabet) or list of references to other symbols. What I would have to return for the cytogenetic locus '1' symbol in getMatches() is an alphabet that has two symbols: one representing the first sequence, the other the second. But, a symbol in biojava needs to have a token. Resume: To get it working then, I could - specifiy extra tokens for the 'anonymous' sequence-representing symbols. - change the BasisSymbol interface so that a BasisSymbol can reflect ambiguity over other sequences?
The first choice is not acceptable, since I would create symbols with tokens that are not contained in the alphabet described in the AlphabetML-instance. The second could pile up an amount of extra code changes spread everywhere around the biojava API. Maybe I got something wrong here. Maybe I do not really understand the theory. Has anybody explanations or suggestions? I would much of appreciate it. Regards, Armin _______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l