Hi everybody, I was trying to retrieve fasta protein sequences from GenBank by id using seqret but it was not possible for every id. However, retrieval by GI is allowed.
Additionally, during the indexing process (dbifasta) I've obtained some errors like this one: Warning: Duplicate ID skipped: 'AC000348_16' All hits will point to first ID found I was looking for an explanation to this behaviour and I've found that skipped IDs correspond to CDS from genomic sequences and have this format: >gi|10121909|gb|AAG13419.1|AC000348_16 T7N9.24 [Arabidopsis thaliana] MELPDVPVWRRVIVSAFFEALTFNIDIEEERSEIMMKTGAVVSNPRSRVKWDAFLSFQRDTSHNFTDRLY... >gi|8778864|gb|AAF79863.1|AC000348_16 T7N9.28 [Arabidopsis thaliana] MSVVLQITKDWVQALLGFLLLSFANISTRTNHKHFPHGSCSSIMAGFWIYMYIYSYLFITLKIIDLTS... In the previous entries, when I try to retrieve one of them by the first identifier (gi), I can get both of them. When I try to do retrievals using the last identifier (AC000348_16), I only get the first one. But it's impossible to do retrievals by second identifier (AAG13419.1 and AAF79863.1). However, sequences with the following format can be well indexed: >gi|64029|emb|CAA23986.1| reading frame [Lophius americanus] MKMVSSSRLRCLLVLLLSLTASISCSFAGQRDSKLRLLLHRYPLQGSKQDMTRSALAELLLSDLLQGENE ... and these sequences can be well retrieved by first and second identifiers (64029 and CAA23986.1). Does anybody know how to solve these problems? Thanks in advance, Natalia _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
