On 16/07/12 20:27, Galt Barber wrote:
> Jim recently told another user with the same problem to split your huge
> database up
> as Hiram advised above.  This is due to using 32bit pointers. If you use
> 64-bit pointers,
> you can access more ram but since your pointers now require twice the
> storage,
> it is a waste unless you have a machine with huge ram.

Hi, Galt.

We are using a server with 256GiB RAM, running 64-bit Ubuntu 12.04 LTS.

> In any case another benefit of splitting the database up
> on a large machine or cluster is that you can run multiple instances of blat
> in parallel, one process for each piece if you want.

That's good advice and I've been thinking about doing it. What put me 
off a bit is that we have huge query files from NGS sequencing: I'm 
working on a pipeline that my colleagues wrote based on parsing BLAST 
output - I want to use BLAT instead, with the BLAST -m 8 TAB output.

The query files are de-duped Illumina FASTQ files in FASTA format. We 
are looking for chimeras in dee sequencing data - Currently by BLAST.

I was concerned about the overhead of reading the query files into each 
BLAT against part of the database concurrently. However, I've not tried 
that yet, and it might not be the problem I anticipate, but our server 
is i/o-bound and it's something I will have to take into account.

> Putting the results back together just amounts to cat-ing all the
> results files together.

OK, had worked that one out :-)

> For psl output filtering, there are even tools to help such as pslReps
> and pslCDnaFilter.

Thanks all for your advice,

   Tony.
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to