Dear Tony, I just wish to mention here that we have effectively mapped millions of paired-end illumina sequences on to fungal size genomes by effectively splitting the query into files of 1000-5000 sequences on a desktop running Ubuntu 10.04 and 32bit. We simply use the unix command: split -l 2 -a 5 -d fasta_file x/suffix to generate the files in directory 'x'. We have be using this approach to visually assess assembly quality through integration with Gbrowse.
HTH, Mbandi S.K ---------------------------------------------------- PhD Student Universiteit van Wes-Kaapland Private Bag x17 Bellville, 7535 South Africa. Phone: 27 (0)21 9592364 (work) Fax: 27 (0)21 9592512 (work) "Be keen to serendipity", Mbandi S.K 2007 > On 16/07/12 20:27, Galt Barber wrote: >> Jim recently told another user with the same problem to split your huge >> database up >> as Hiram advised above. This is due to using 32bit pointers. If you use >> 64-bit pointers, >> you can access more ram but since your pointers now require twice the >> storage, >> it is a waste unless you have a machine with huge ram. > > Hi, Galt. > > We are using a server with 256GiB RAM, running 64-bit Ubuntu 12.04 LTS. > >> In any case another benefit of splitting the database up >> on a large machine or cluster is that you can run multiple instances of >> blat >> in parallel, one process for each piece if you want. > > That's good advice and I've been thinking about doing it. What put me > off a bit is that we have huge query files from NGS sequencing: I'm > working on a pipeline that my colleagues wrote based on parsing BLAST > output - I want to use BLAT instead, with the BLAST -m 8 TAB output. > > The query files are de-duped Illumina FASTQ files in FASTA format. We > are looking for chimeras in dee sequencing data - Currently by BLAST. > > I was concerned about the overhead of reading the query files into each > BLAT against part of the database concurrently. However, I've not tried > that yet, and it might not be the problem I anticipate, but our server > is i/o-bound and it's something I will have to take into account. > >> Putting the results back together just amounts to cat-ing all the >> results files together. > > OK, had worked that one out :-) > >> For psl output filtering, there are even tools to help such as pslReps >> and pslCDnaFilter. > > Thanks all for your advice, > > Tony. > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
