On 16/07/12 22:28, Mbandi S.K wrote: > Dear Tony, > > I just wish to mention here that we have effectively mapped millions of > paired-end illumina sequences on to fungal size genomes by effectively > splitting the query into files of 1000-5000 sequences on a desktop running > Ubuntu 10.04 and 32bit. We simply use the unix command: split -l 2 -a 5 -d > fasta_file x/suffix to generate the files in directory 'x'. We have be > using this approach to visually assess assembly quality through > integration with Gbrowse.
Hi, Mbandi. I use Jim Kent's faSplit to do this sort of thing: http://genomewiki.ucsc.edu/index.php/Kent_source_utilities There must be a typo in your split command, because "-l 2" means create output with two lines per file (by default, split creates output with 1000 lines per file). Even that would create hundreds of thousands of files in my case, and each time I run BLAT on one of these files, I have to read the database and create indexes, but I'm reading about using the "gfServer" instead to keep these indexes in memory... We are looking for chimeric hybrid reads in Illumina deep sequencing data and we typically have 20 million raw reads, deduped and quality filtered. I want to BLAT against the full human genome because we are looking for matches to intronic sequences. We have already identified miRNA/mRNA chimeras using smaller databases. I'm very impressed by the performance of BLAT, but as I posted here we get segfaults running it with a ~20GiB database created from a full Ensemble version of hg19. Our BLAT does not segfault on smaller databases and I'll follow the advice of people on this list to split the database into smaller parts. It's quite likely that I did something wrong creating my 2bit format database from the top-level Ensemble hg19 FASTA file: Homo_sapiens.GRCh37.67.dna.toplevel.fa The 2bit version of hg19 that I downloaded from UCSC is a lot smaller and "blat" doesn't cause any segfaults: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit This runs in < 4GiB RAM, but mine uses >20GiB and causes segfaults :-( I put everything into my BLAT 2bit DB, including all the supercontigs and patches etc., which was probably not the right thing to do... Thanks for your helpful comments, Tony. _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
