Dear Jon, > Dear Natalia > > By default, dbifasta will index the ID name and the accession number (if > present). > > To index the Sequence Version, GI number and words in the description, you > must > run dbifasta with the '-fields' qualifier, e.g. "-fields acc", "-fields sv > acc" > etc. If you don't, you will not be able to retrieve by those fields. Please > see http://emboss.sourceforge.net/apps/cvs/dbifasta.html. > Yes indexation was done taking into account the -field parameter :-( > dbifasta only retrieves the first of any duplicate entries. So far as I'm > aware > dbxfasta can retrieve duplicate entries. > We'll try with dbxfasta! > Does that help? Feel free to get back in touch. > Yes, a lot. Thank you very much Regards, Natalia > Cheers > > Jon > > > > > >> Hi everybody, >> >> I was trying to retrieve fasta protein sequences from GenBank by id >> using seqret but it was not possible for every id. However, retrieval by >> GI is allowed. >> >> Additionally, during the indexing process (dbifasta) I've obtained some >> errors like this one: >> >> Warning: Duplicate ID skipped: 'AC000348_16' All hits will point to >> first ID found >> >> I was looking for an explanation to this behaviour and I've found that >> skipped IDs correspond to CDS from genomic sequences and have this format: >> >> >gi|10121909|gb|AAG13419.1|AC000348_16 T7N9.24 [Arabidopsis thaliana] >> MELPDVPVWRRVIVSAFFEALTFNIDIEEERSEIMMKTGAVVSNPRSRVKWDAFLSFQRDTSHNFTDRLY... >> >gi|8778864|gb|AAF79863.1|AC000348_16 T7N9.28 [Arabidopsis thaliana] >> MSVVLQITKDWVQALLGFLLLSFANISTRTNHKHFPHGSCSSIMAGFWIYMYIYSYLFITLKIIDLTS... >> >> In the previous entries, when I try to retrieve one of them by the first >> identifier (gi), I can get both of them. When I try to do retrievals >> using the last identifier (AC000348_16), I only get the first one. But >> it's impossible to do retrievals by second identifier (AAG13419.1 and >> AAF79863.1). >> >> However, sequences with the following format can be well indexed: >> >> >gi|64029|emb|CAA23986.1| reading frame [Lophius americanus] >> MKMVSSSRLRCLLVLLLSLTASISCSFAGQRDSKLRLLLHRYPLQGSKQDMTRSALAELLLSDLLQGENE ... >> >> and these sequences can be well retrieved by first and second >> identifiers (64029 and CAA23986.1). >> >> Does anybody know how to solve these problems? >> Thanks in advance, >> Natalia >> _______________________________________________ >> EMBOSS mailing list >> [email protected] >> http://lists.open-bio.org/mailman/listinfo/emboss >> >> > > > > >
_______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
