Hi, Tony! Reminder: there are many excellent short-read aligners available today.
-Galt On Mon, Jul 16, 2012 at 5:56 PM, Tony Travis <[email protected]> wrote: > On 16/07/12 22:28, Mbandi S.K wrote: > >> Dear Tony, >> >> I just wish to mention here that we have effectively mapped millions of >> paired-end illumina sequences on to fungal size genomes by effectively >> splitting the query into files of 1000-5000 sequences on a desktop running >> Ubuntu 10.04 and 32bit. We simply use the unix command: split -l 2 -a 5 -d >> fasta_file x/suffix to generate the files in directory 'x'. We have be >> using this approach to visually assess assembly quality through >> integration with Gbrowse. >> > > Hi, Mbandi. > > I use Jim Kent's faSplit to do this sort of thing: > > > http://genomewiki.ucsc.edu/**index.php/Kent_source_**utilities<http://genomewiki.ucsc.edu/index.php/Kent_source_utilities> > > There must be a typo in your split command, because "-l 2" means create > output with two lines per file (by default, split creates output with 1000 > lines per file). Even that would create hundreds of thousands of files in > my case, and each time I run BLAT on one of these files, I have to read the > database and create indexes, but I'm reading about using the "gfServer" > instead to keep these indexes in memory... > > We are looking for chimeric hybrid reads in Illumina deep sequencing data > and we typically have 20 million raw reads, deduped and quality filtered. I > want to BLAT against the full human genome because we are looking for > matches to intronic sequences. We have already identified miRNA/mRNA > chimeras using smaller databases. I'm very impressed by the performance of > BLAT, but as I posted here we get segfaults running it with a ~20GiB > database created from a full Ensemble version of hg19. > > Our BLAT does not segfault on smaller databases and I'll follow the advice > of people on this list to split the database into smaller parts. It's quite > likely that I did something wrong creating my 2bit format database from the > top-level Ensemble hg19 FASTA file: > > Homo_sapiens.GRCh37.67.dna.**toplevel.fa > > The 2bit version of hg19 that I downloaded from UCSC is a lot smaller and > "blat" doesn't cause any segfaults: > > > http://hgdownload.cse.ucsc.**edu/goldenPath/hg19/bigZips/**hg19.2bit<http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit> > > This runs in < 4GiB RAM, but mine uses >20GiB and causes segfaults :-( > > I put everything into my BLAT 2bit DB, including all the supercontigs and > patches etc., which was probably not the right thing to do... > > Thanks for your helpful comments, > > Tony. > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
