I'm not sure, but it should simply be a matter of defining an alphabet where each symbol in the alphabet is a 3-letter combo. Then you can use the alphabet to tokenize the input string appropriately.
Mark will know more about this than me. Mark - comments? cheers, Richard On Tue, 2006-08-01 at 17:41 +1000, Neil Bacon wrote: > Hi, > I'm looking at extending biojava sequence io to read sequences from > patents (initially current US data formats, later perhaps older formats > and other jurisdictions). > Anyone done this already or interested? > > Protein data uses 3-letter codes. I found an old posting about 3-letter > codes: > > [Biojava-dev] Protein alphabet names > http://lists.open-bio.org/pipermail/biojava-dev/2002-October/000143.html > > >/ - Add an additional tokenization (probably called > />/ "three-letter" > />/ unless someone comes up with a better > />/ suggestion) for people > />/ who actually want 3-letter codes. > / > > Did this happen (I can't find it)? > I'll try extending WordTokenization to do this unless someone has > already done it or can advise me better (I'm new here and advice would > be very welcome). > > Cheers, > Neil Bacon > > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
