On 17/07/12 22:19, Hiram Clawson wrote:
> Good Afternoon Tony:
>
> We know why blat will crash on this sequence:
>    Homo_sapiens.GRCh37.67.dna.toplevel.fa.gz
>
> The internals of the program has 32-bit integers that can
> not count past 4 GiB.  This fasta file has 20 GiB of sequence.
> It will not function as a single .2bit file.

Hi, Hiram.

In fact, I created a 2bit file for each FASTA record in the GRCh37 file, 
which a colleague of mine had used to create a BLAST database. I ran 
"blat" against a 'file of files' (i.e. list with one filename per line 
as described in the BLAT manual), not against a single 2bit file.

The initial reason I did this was to compare the performance of BLAST 
and BLAT using the same DB - No contest about that, of course!

However, I wasn't aware of the 32-bit integer limitation in "blat", 
especially when I compiled it on a 64-bit platform. I realise that the 
size of an "int" is implementation-dependent, but the most 'efficient' 
size is the natural word size of the processor architecture. But, of 
course, the 64-bit compiler uses 32-bit int's for legacy reasons...

It's a different issue if you are talking about pointers because, unless 
I compile and run 32-bit binaries, I will be using 64-bit pointers when 
compiling an running "blat" under 64-bit Linux:

#include <stdio.h>
#include <stdlib.h>

int i;
int *p;

main() {
        printf("%ld\n", sizeof(i) * 8);
         printf("%ld\n", sizeof(p) * 8);
}

In any case, I've now realised that "blat" has a 4GiB memory limit.

>  I'm not sure
> you would want it to either.  Each of the haplotypes in this
> file reproduce the entire chromosome that the haplotype is contained
> within.  There are eight complete copies of chr1, five complete
> copies of chr2, etc.  It has two different copies of chrY, one
> completely empty of sequence.

You're right on the mark about that and I have to admit my mistake(!)

Thanks again for your helpful advice,

   Tony.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to