On 10/04/2011 02:38 PM, Fernando Martinez wrote:
Hi, I am trying to retrieve sequences from a multi-fasta file were there are
identical sequences and i want to extract only the ones in my list, how can
I do that?
Example:

Multi.fasta file:

seq1
atataga...
seq2
ttatggttca..
[...]
seq1
atataga...
[...]
And I only want to take seq1 an seq2, not two times seq1!!

If you really must start from that file .... as usual with EMBOSS there are several ways to do it

1. Index with dbifasta
----------------------

You can index with the older dbifasta program. This does not allow duplicate IDs so only one seq1 will be indexed.

% dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat simple -auto

Then define a database in your .embossrc file:

DB multi [
  format: "fasta"
  method: "emblcd"
  type: "nucleotide"
  directory: "."
]

Then replace "Multi.fasta" in your listfile with "multi" and you will have the sequences you want.



2. rewrite as single files in a new directory, then rewrite as one file

% mkdir multi
% seqret -ossingle -odsir multi Multi.fasta -auto
% ls multi
seq1.fasta  seq2.fasta ...

% cd multi
seqret '*.fasta' ../Single.fasta

(note: you do need the quotes around the wild card file name)

this will give you a file Single.fasta in the original directory with only the last version of each id.



3. Write a new application
---------------------------

Another approach is to write your own new application. A copy of seqret which keeps a table of ids and rejects any sequence with known ID will rewrite the file (in any format) with only the first occurrence of each id. We will add this to the next release.


4. ... there may be more ways, but these will be enough to solve your problem.

Hope that helps,

Peter Rice
EMBOSS Team
_______________________________________________
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss

Reply via email to