Hi there at Biojava, I have two FASTA files - one containing amino acid sequences and the other containing dna sequences.
In the AA FASTA file I have something like : >FBpp0077713 type=protein; loc=2L:join(384551..384894,385701..385746,386308..386576,386703..387270); ID=FBpp0077713; name=al-PA; parent=FBgn0000061,FBtr0078053; dbxref=FlyBase:FBpp0077713,GB_protein:AAF51505.1,GB_protein:AAF51505,FlyBase_Annotation_IDs:CG3935-PA,REFSEQ:NP_722629; MD5=64a866db3e2913b97a2158c2de9d02f6; length=408; release=r5.9; species=Dmel; MGISEEIKLEELPQEAKLAHPDAVVLVDRAPGSSAASAGAALTVSMSVSG GAPSGASGASGGTNSPVSDGNSDCEADEYAPKRKQRRYRTTFTSFQLEEL... etc etc etc I would like to parse this header line in particular the loc attribute and extract it from the entry in the DNA FASTA file (so I get the genomic data for the protein) >FBgn0000061 type=gene; loc=2L:378116..387439; ID=FBgn0000061; name=al; dbxref=FlyBase:FBgn0000061,FlyBase:FBan0003935,FlyBase_Annotation_IDs:CG3935,GB:AE003589,GB_protein:AAF51505,GB:AY121696,GB_protein:AAM52023,GB:BI485174,GB:CZ486795,GB:L08401,GB_protein:AAA28840,UniProt/Swiss-Prot:Q06453,INTERPRO:IPR000047,INTERPRO:IPR001356,INTERPRO:IPR003654,INTERPRO:IPR009057,INTERPRO:IPR012287,bdgpinsituexpr:al,dedb:5830,drsc:FBgn0000061,flight:FBgn0000061,flyatlas:FBgn0000061,flyexpress:FBgn0000061,flygrid:59464,flymine:FBgn0000061,geo:FBgn0000061,hdri:FBgn0000061,if:/gene/aristal.htm,orthologs:ensANOGA:ENSANGP00000011877,orthologs:ensBOSTA:ENSBTAP00000015907,orthologs:ensCANFA:ENSCAFP00000009888,orthologs:ensGALGA:ENSGALP00000005255,orthologs:ensHOMSA:ENSP00000298420,orthologs:ensMACMU:ENSMMUP00000007349,orthologs:ensMONDO:ENSMODP00000008388,orthologs:ensPANTR:ENSPTRP00000004281,orthologs:ensRATNO:ENSRNOP00000027186,orthologs:ensTETNI:GSTENP00015517001,orthologs:graORYSA:Q6YYB8,orthologs:graORYSA:Q8W0T5,orthologs:modCAEEL:WBGene00044330,orthologs:modDA! NRE:ZDB-GENE-990415-15,orthologs:modMUSMU:MGI:1097716,panther:FBgn0000061; cyto_range=21C1-21C1; gbunit=AE014134; MD5=0f5568cf13aeb2c7076f11b1ce3d6b2f; length=9324; release=r5.9; species=Dmel; GTAGTTTGCTGCCGGCTCTGGAACAGCCCGGTCATCTCGTCGCGTTCGGT TCCGATTCCGATTCGAATAGTCGAGCTGGGGATACATTGTTGTTTCCGGG etc etc etc I understand this is not exactly conventional, but does biojava support the parsing of the loc attribute ? (join, complement etc.) Many Thanks JP _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
