Thank you, David and Peter. My input file actually has shortened IDs (shIDs) and alternating lines of fasta header and sequences (cleaned-up).
First, I copied my input file to a name without EMBOSS special characters: cp Athaliana_167_TAIR9.fa.shIDscleaned-up Athaliana_167_TAIR9_UNshuffled.fa Next, I ran shuffleseq using advice from both of you, as follows: time shuffleseq -sformat pearson Athaliana_167_TAIR9_UNshuffled.fa EMBOSS.fa Shuffle a set of sequences maintaining composition real 15m13.015s user 15m11.998s sys 0m0.844s And this works, so thank you both very much. Best, Anand _____ *Anandk*umar *S*urendra*rao*, PhD +1.530.574.5134 +91.91760.70887 *note to self:* For ChrC, I compared sequences using BLAST 2 - no similarity detected, as expected. For Chr1 and ChrC, I used a Perl script to calculate a,c,g,t,n and ? and found them to be exactly the same before and after shuffling. Perl script = summarizeACGTcontent.pl On Fri, Nov 9, 2018 at 4:03 AM Peter Rice <[email protected]> wrote: > Hi Anand, > > As we found when we wrote EMBOSS, "FASTA format" is actually hard to > define. The problem is the many ways you can define the ID, and the > other information on the first line (it is amazing how much information > you can encode in a simple description). > > Our solution was to define a set of formats that all read FASTA files, > but parse the first line in different ways, for example "ncbi format" > tries to read the NCBI database and id syntax. > > We added a format to read the sequence ID as-is for really awkward > cases, and in honour of the author of FASTA we called it "pearson" > > So, if you add -sformat pearson it should read the full IDs up to the > first space. If you re-read the output, you should use -sf pearson again > (-sf is just short for -sformat) > > Hope that helps. > > Peter Rice > [email protected] > > On 09/11/2018 07:30, David Bauer wrote: > > Hi Anand, > > > > if you run “shuffleseq –help” you will see the type of input and output > > sequences. > > > > Version: EMBOSS:6.5.7.0 > > > > Standard (Mandatory) qualifiers: > > > > [-sequence] seqall Sequence(s) filename and optional > > format, or > > > > reference (input USA) > > > > [-outseq] seqoutall [<sequence>.<format>] Sequence set(s) > > > > filename and optional format (output > USA) > > > > The “all” in seqall and seqoutall indicates that input and output can be > > sequence files with multiple sequences. > > > > This can be fasta format or any other sequence format supported by > > EMBOSS (genbank, embl etc.) > > > > The names of the sequences as they are in the original file, will be > > preserved in the output file. > > > > If I try to reproduce your example with the file downloaded from IPK: > > > > shuffleseq Athaliana_167_TAIR9.fa test1.fa > > > > the output file contains the sequences as named in the input file: > > > > infoseq -only -name -desc test1.fa > > > > Name Description > > > > Chr1 CHROMOSOME dumped from ADB: Feb/3/09 16:9; last updated: > > 2007-12-20 > > > > Chr2 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated: > > 2007-12-20 > > > > Chr3 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated: > > 2007-12-20 > > > > Chr4 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated: > > 2007-12-20 > > > > Chr5 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated: > > 2007-12-20 > > > > ChrM CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated: > > 2005-06-03 > > > > ChrC CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated: > > 2005-06-03 > > > > Your input file contains in the name “shIDscleaned-up”. You may have > > done some modifications to the sequence names which confuse EMBOSS. > > > > You can test this by running the infoseq as above and check if you get > > for “Name” what you expect. > > > > Make sure you don’t have any “:” characters in the sequence names in > > your fasta file. This character has a special meaning in EMBOSS sequence > > names. > > > > Hope this helps. > > > > Sincerely, > > > > David. > > > > *Von:*EMBOSS <[email protected]> > > > *Im Auftrag von *Anandkumar Surendrarao > > *Gesendet:* 09 November 2018 04:20 > > *An:* [email protected] > > *Betreff:* [EMBOSS] shuffleseq for multifasta? > > > > Greetings! > > > > I am new to EMBOSS, and trying to use shufflseq to randomly shuffle > > entire genomes (one-by-one). My input genomic sequences are in > > multifasta format. And I wish to retain the same multifasta format for > > the output file as well, containing the shuffled DNA sequences. > > > > From the information at > > http://emboss.sourceforge.net/apps/cvs/emboss/apps/shuffleseq.html, it > > appears to me that FASTA format for neither input not output is > > supported. Am I mistaken? > > > > OR > > > > Is there a way to specify (multi)FASTA as both input and output formats? > > > > In one run that I completed with a genome assembly with 5 chromosmes - > > Chr1 ... Chr5, the syntax I used was: > > > > shuffleseq -sequence Athaliana_167_TAIR9.fa.shIDscleaned-up -outseq > > Athaliana_167_TAIR9_EmbossShuffled.fas > > > > Strangely, in the output file, the fasta headers were all repetitive > Chr1. > > > > Hence my confusion. Could someone please clarify what my input > > formatting should be and the correct syntax? > > > > Thanks, in advance, for your help. > > > > Sincerely, > > > > Anand > > > > _____ > > > > *Anand**k*umar *S*urendra*rao*, PhD > > > > +1.530.574.5134 > > > > +91.91760.70887 > > > > > > _______________________________________________ > > EMBOSS mailing list > > [email protected] > > http://mailman.open-bio.org/mailman/listinfo/emboss > > > > --- > This email has been checked for viruses by AVG. > https://www.avg.com > >
_______________________________________________ EMBOSS mailing list [email protected] http://mailman.open-bio.org/mailman/listinfo/emboss
