On 16/07/12 22:28, Mbandi S.K wrote:
> Dear Tony,
>
> I just wish to mention here that we have effectively mapped millions of
> paired-end illumina sequences on to fungal size genomes by effectively
> splitting the query into files of 1000-5000 sequences on a desktop running
> Ubuntu 10.04 and 32bit. We simply use the unix command: split -l 2 -a 5 -d
> fasta_file x/suffix  to generate the files in directory 'x'. We have be
> using this approach to visually assess assembly quality through
> integration with Gbrowse.

Hi, Mbandi.

I use Jim Kent's faSplit to do this sort of thing:

   http://genomewiki.ucsc.edu/index.php/Kent_source_utilities

There must be a typo in your split command, because "-l 2" means create 
output with two lines per file (by default, split creates output with 
1000 lines per file). Even that would create hundreds of thousands of 
files in my case, and each time I run BLAT on one of these files, I have 
to read the database and create indexes, but I'm reading about using the 
"gfServer" instead to keep these indexes in memory...

We are looking for chimeric hybrid reads in Illumina deep sequencing 
data and we typically have 20 million raw reads, deduped and quality 
filtered. I want to BLAT against the full human genome because we are 
looking for matches to intronic sequences. We have already identified 
miRNA/mRNA chimeras using smaller databases. I'm very impressed by the 
performance of BLAT, but as I posted here we get segfaults running it 
with a ~20GiB database created from a full Ensemble version of hg19.

Our BLAT does not segfault on smaller databases and I'll follow the 
advice of people on this list to split the database into smaller parts. 
It's quite likely that I did something wrong creating my 2bit format 
database from the top-level Ensemble hg19 FASTA file:

   Homo_sapiens.GRCh37.67.dna.toplevel.fa

The 2bit version of hg19 that I downloaded from UCSC is a lot smaller 
and "blat" doesn't cause any segfaults:

   http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

This runs in < 4GiB RAM, but mine uses >20GiB and causes segfaults :-(

I put everything into my BLAT 2bit DB, including all the supercontigs 
and patches etc., which was probably not the right thing to do...

Thanks for your helpful comments,

   Tony.
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to