Re: [EMBOSS] shuffleseq for multifasta?

Peter Rice Fri, 09 Nov 2018 01:09:02 -0800

Hi Anand,

As we found when we wrote EMBOSS, "FASTA format" is actually hard todefine. The problem is the many ways you can define the ID, and theother information on the first line (it is amazing how much informationyou can encode in a simple description).

Our solution was to define a set of formats that all read FASTA files,but parse the first line in different ways, for example "ncbi format"tries to read the NCBI database and id syntax.

We added a format to read the sequence ID as-is for really awkwardcases, and in honour of the author of FASTA we called it "pearson"

So, if you add -sformat pearson it should read the full IDs up to thefirst space. If you re-read the output, you should use -sf pearson again(-sf is just short for -sformat)


Hope that helps.

Peter Rice
[email protected]

On 09/11/2018 07:30, David Bauer wrote:

Hi Anand,
if you run “shuffleseq –help” you will see the type of input and outputsequences.
Version: EMBOSS:6.5.7.0

    Standard (Mandatory) qualifiers:
[-sequence] seqall Sequence(s) filename and optionalformat, or
                                   reference (input USA)

   [-outseq]            seqoutall  [<sequence>.<format>] Sequence set(s)

                                   filename and optional format (output USA)
The “all” in seqall and seqoutall indicates that input and output can besequence files with multiple sequences.
This can be fasta format or any other sequence format supported byEMBOSS (genbank, embl etc.)
The names of the sequences as they are in the original file, will bepreserved in the output file.
If I try to reproduce your example with the file downloaded from IPK:

shuffleseq Athaliana_167_TAIR9.fa test1.fa

the output file contains the sequences as named in the input file:

infoseq -only -name -desc test1.fa

Name           Description
Chr1 CHROMOSOME dumped from ADB: Feb/3/09 16:9; last updated:2007-12-20
Chr2 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated:2007-12-20
Chr3 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated:2007-12-20
Chr4 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated:2007-12-20
Chr5 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated:2007-12-20
ChrM CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated:2005-06-03
ChrC CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated:2005-06-03
Your input file contains in the name “shIDscleaned-up”. You may havedone some modifications to the sequence names which confuse EMBOSS.
You can test this by running the infoseq as above and check if you getfor “Name” what you expect.
Make sure you don’t have any “:” characters in the sequence names inyour fasta file. This character has a special meaning in EMBOSS sequencenames.
Hope this helps.

Sincerely,

David.
*Von:*EMBOSS <[email protected]>*Im Auftrag von *Anandkumar Surendrarao
*Gesendet:* 09 November 2018 04:20
*An:* [email protected]
*Betreff:* [EMBOSS] shuffleseq for multifasta?

Greetings!
I am new to EMBOSS, and trying to use shufflseq to randomly shuffleentire genomes (one-by-one). My input genomic sequences are inmultifasta format. And I wish to retain the same multifasta format forthe output file as well, containing the shuffled DNA sequences.
From the information athttp://emboss.sourceforge.net/apps/cvs/emboss/apps/shuffleseq.html, itappears to me that FASTA format for neither input not output issupported. Am I mistaken?
OR

Is there a way to specify (multi)FASTA as both input and output formats?
In one run that I completed with a genome assembly with 5 chromosmes -Chr1 ... Chr5, the syntax I used was:
shuffleseq -sequence Athaliana_167_TAIR9.fa.shIDscleaned-up -outseqAthaliana_167_TAIR9_EmbossShuffled.fas
Strangely, in the output file, the fasta headers were all repetitive Chr1.
Hence my confusion. Could someone please clarify what my inputformatting should be and the correct syntax?
Thanks, in advance, for your help.

Sincerely,

Anand

_____

*Anand**k*umar *S*urendra*rao*, PhD

+1.530.574.5134

+91.91760.70887


_______________________________________________
EMBOSS mailing list
[email protected]
http://mailman.open-bio.org/mailman/listinfo/emboss


---
This email has been checked for viruses by AVG.
https://www.avg.com

_______________________________________________
EMBOSS mailing list
[email protected]
http://mailman.open-bio.org/mailman/listinfo/emboss

Re: [EMBOSS] shuffleseq for multifasta?

Reply via email to