Hi... We've been seeing a number of problem recently relating to the construction of alphabets, and in particular binding names to ambiguous and gap symbols. I'd like to propose a patch which addresses a number of issues in this area, and should leave us with a more solid Symbol/Alphabet infrastructure.
The idea is: - Remove the `token' (single-character name) property from the Symbol interface. This was problematic since it was undefined in many cases (especially cross-product alphabets where there might well be more symbols than ASCII characters). - Replace the old SymbolParser (one way string -> symbol) map with a new interface, SymbolTokenization (a two-way string <--> symbol map). In doing so, all sorts of cruft dies. In particular: - Alphabet creation can be simplified. - We can get back to the idealized situation whereby alphabets IMPLICITLY contain all possible ambiguity symbols (including the gap symbol), and no longer have to pre-seed ambiguity symbols where they have tokens associated with them (e.g. all DNA ambiguities, and a sub-set of protein ambiguities). - We are able to handle multiple naming schemes (for instance, arguments about what character to use for the gap symbol) in a clean, transparent way. The underlying Symbols can remain the same whichever naming scheme (==SymbolTokenization) you use. - All the conventions for naming of Cross-product symbols are neatly packaged together in CrossProductTokenization.java. It's easy to add alternative conventions in the future if anyone needs to. Right now, Symbols do still have a `name' property, but this is there for internal and debugging use. For public display, you should always use a SymbolTokenization. Some people have suggested that even this propery should go. I'm certainly open to comments on this issue. The resulting patch has turned out to be non-trivial -- quite a lot of code has been touched. There will also need to be some (although hopefully not too many) changes to application code. However, I'll argue that the patch makes BioJava's Symbol code a lot stronger and more robust to future developments. I'd thus like to see it applied. There are a number of options: - Apply it straight away, and include all these changes in BioJava 1.2 - Release 1.2 in the reasonably near future, and apply this patch in the next development series. - Just ditch it and keep the status quo (although there definitely will have to be at least some tidying of Alphabet creation in the not-too-distant future). - Something else? This is an issue which will affect a lot of people, so I'd like to hear as many views as possible. You can download the current patched source tree from: http://www.biojava.org/download/source/biojava-symtoke-20011016.tar.gz I've ported the existing JUnit test suite across to the new API, and added a few extra tests for functionality which wasn't being exercised by existing tests. Everything is passing cleanly (but more test cases are always welcome...). There are a few issues which should be resolved before checking this code into the main tree: - There's still some cruft left over from the old token system. This should be tracked down and removed. - There is some use of a temporary method AlphabetManager.parse(SymbolTokenization, String). These calls should probably be replaced by new SimpleSymbolList(SymbolTokenization, String); - AllSymbolsAlphabet has been removed. I know some people (Matthew?) find this very useful, so I should probably write a replacement. Anyway, test, hack, and let me know what you think! Thomas _______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l