Re: [EMBOSS] uniq sequences on a list

Peter Rice Tue, 04 Oct 2011 07:14:47 -0700

On 10/04/2011 02:38 PM, Fernando Martinez wrote:

Hi, I am trying to retrieve sequences from a multi-fasta file were there are
identical sequences and i want to extract only the ones in my list, how can
I do that?
Example:


Multi.fasta file:

seq1

atataga...

seq2

ttatggttca..
[...]

seq1

atataga...
[...]
And I only want to take seq1 an seq2, not two times seq1!!

If you really must start from that file .... as usual with EMBOSS thereare several ways to do it


1. Index with dbifasta
----------------------

You can index with the older dbifasta program. This does not allowduplicate IDs so only one seq1 will be indexed.

% dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformatsimple -auto


Then define a database in your .embossrc file:

DB multi [
  format: "fasta"
  method: "emblcd"
  type: "nucleotide"
  directory: "."
]

Then replace "Multi.fasta" in your listfile with "multi" and you willhave the sequences you want.




2. rewrite as single files in a new directory, then rewrite as one file

% mkdir multi
% seqret -ossingle -odsir multi Multi.fasta -auto
% ls multi
seq1.fasta  seq2.fasta ...

% cd multi
seqret '*.fasta' ../Single.fasta

(note: you do need the quotes around the wild card file name)

this will give you a file Single.fasta in the original directory withonly the last version of each id.




3. Write a new application
---------------------------

Another approach is to write your own new application. A copy of seqretwhich keeps a table of ids and rejects any sequence with known ID willrewrite the file (in any format) with only the first occurrence of eachid. We will add this to the next release.

4. ... there may be more ways, but these will be enough to solve yourproblem.


Hope that helps,

Peter Rice
EMBOSS Team
_______________________________________________
EMBOSS mailing list
EMBOSS@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/emboss

Re: [EMBOSS] uniq sequences on a list

Reply via email to