Yes, blat is very good at mRNA alignment. ----- Original Message ----- From: "Peng Yu" <[email protected]> To: "Hiram Clawson" <[email protected]> Cc: [email protected] Sent: Tuesday, June 15, 2010 6:01:21 PM GMT -08:00 Tijuana / Baja California Subject: Re: [Genome] parallel blat
I know there are other much faster tools for ungapped or small-gap alignment. But I think that blat is still the best one for aligning mRNAs to the genome, which may have very large gaps. Am I correct on this? On Tue, Jun 15, 2010 at 7:53 PM, Hiram Clawson <[email protected]> wrote: > Absolutely. Break your target genome up into several hundred > overlapping pieces. On the order of 5 to 10 million bases, or > even smaller. Partition your 10 million short sequences into several hundred > multiple record fasta files. Run a job for each target genome chunk > against each query fasta file. These are all separate processes. > > Please note, blat is not necessarily the best tool for short sequence > alignment. There are other much better tools for short sequence > alignment. See also: > > http://en.wikipedia.org/wiki/List_of_sequence_alignment_software > > --Hiram > > ----- Original Message ----- > From: "Peng Yu" <[email protected]> > To: "Hiram Clawson" <[email protected]> > Cc: [email protected] > Sent: Tuesday, June 15, 2010 5:24:01 PM GMT -08:00 Tijuana / Baja California > Subject: Re: [Genome] parallel blat > > I'm not sure what you described although I thought I understood. > > Suppose I have 10 million short sequences to be aligned to the human > genome. It is making sense to split the 10 million sequences in 10 > files (each 1 million). Then I run 10 blat commands simultaneously. > Each blat command will load all the chromosomes. Are you suggesting to > break all_human_chromosomes.list into a number of smaller lists? > > blat -t=dna -q=dna -tileSize=11 -stepSize=5 > all_human_chromosomes.list short_seq0.fa short_seq0.psl > ... > blat -t=dna -q=dna -tileSize=11 -stepSize=5 > all_human_chromosomes.list short_seq9.fa short_seq9.psl > > > On Tue, Jun 15, 2010 at 7:14 PM, Hiram Clawson <[email protected]> wrote: >> No, this is not what I describe. Only the tiny portion of the >> target genome is loaded and the tiny portion of the query genome >> is loaded. Nothing is duplicated between processes. We regularly >> do this with genomes here and can get perhaps 100,000 processes >> running on a 1,000 CPU core super computer and get the complete >> genome to genome alignment done in a few hours. This is much >> more simple and efficient than trying to write a complicated >> parallel functional program that would be difficult to operate >> in a variety of operating systems. The operating system >> itself is optimized to manage the separate threads of the >> individual processes that it manages. We don't have to >> duplicate that complication. > > -- > Regards, > Peng > -- Regards, Peng _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
