Natalia Jimenez Lozano wrote: > I was looking for an explanation to this behaviour and I've found that > skipped IDs correspond to CDS from genomic sequences and have this format: > > >gi|10121909|gb|AAG13419.1|AC000348_16 T7N9.24 [Arabidopsis thaliana] > MELPDVPVWRRVIVSAFFEALTFNIDIEEERSEIMMKTGAVVSNPRSRVKWDAFLSFQRDTSHNFTDRLY... > >gi|8778864|gb|AAF79863.1|AC000348_16 T7N9.28 [Arabidopsis thaliana] > MSVVLQITKDWVQALLGFLLLSFANISTRTNHKHFPHGSCSSIMAGFWIYMYIYSYLFITLKIIDLTS...
As Jon says, dbxfasta is a solution. However, that is only a partial solution. The real problem is that these FASTA format sequences do indeed have duplicate IDs. This is protein sequence data, so it is not GenBank - was this GenPept or some other database? GenPept and other databases have been known to report "gb" or "emb" as the database for protein sequences!!! A possible solution is to add a new ID format to dbifasta and dbxfasta that uses AAG13419 and AAF7986 as the ID and ignores the AC000348_16 part. Hope this helps, Peter _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
