Derek Gatherer wrote: > Therefore, the answer to the original question, I reckon, is: > shuffleseq is just as good if you choose to shuffle once as to > shuffle 100 times. The same is true for make_randon_dna.
Actually, running make_random_seq twice in a row to generate a single sequence is actually counterproductive. If you do that the first run will have a transition table which exactly matches those in the input sequence, while the second run will generate a transition table from the first randomized sequence, and since that is of finite length, the second run will only obtain an approximation of the originally observed transition frequencies for use in generating the final randomized sequence. The shorter the sequence, the greater this effect will be. > There is > nothing to separate the two programs in performance. The two randomized sequences should have slightly different properties. The output of shuffleseq will maintain composition (exactly), while the output of make_random_seq, with the parameters you used, will maintain dimer composition (approximately). How different the random sequences produced by the two programs are will depend to a great extent on how skewed the dimer composition of the input sequence was with respect to the expected dimer composition (as calculated from the monomer composition). That is, if A,G,C,T are all 25%, and all dimers are 6.25%, the outputs of the two programs would be very similar. However, consider an extreme case which illustrates how much they can differ: % echo AGCTAGCTAGCTAGCT \ | make_random_seq -in - -inproc 2 -order 1 -n >random_sequence_0 TAGCTAGCTAGCTAGC Which is just the original sequence (phase shifted). Similarly, for this very short sequence, even -order 0 would be distinguishable: % echo AGCTAGCTAGCTAGCT | make_random_seq -in - -inproc 2 -order 0 -n >random_sequence_0 GAAAGACTCTGTATGG In this case resulting in a sequence with 5 G, 5 A, 4 T, 2 C, whereas shuffleseq would still have exactly 4 of each. Regards, David Mathog [EMAIL PROTECTED] Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
