On 17/07/12 22:19, Hiram Clawson wrote:
> Good Afternoon Tony:
>
> We know why blat will crash on this sequence:
> Homo_sapiens.GRCh37.67.dna.toplevel.fa.gz
>
> The internals of the program has 32-bit integers that can
> not count past 4 GiB. This fasta file has 20 GiB of sequence.
> It will not function as a single .2bit file.
Hi, Hiram.
In fact, I created a 2bit file for each FASTA record in the GRCh37 file,
which a colleague of mine had used to create a BLAST database. I ran
"blat" against a 'file of files' (i.e. list with one filename per line
as described in the BLAT manual), not against a single 2bit file.
The initial reason I did this was to compare the performance of BLAST
and BLAT using the same DB - No contest about that, of course!
However, I wasn't aware of the 32-bit integer limitation in "blat",
especially when I compiled it on a 64-bit platform. I realise that the
size of an "int" is implementation-dependent, but the most 'efficient'
size is the natural word size of the processor architecture. But, of
course, the 64-bit compiler uses 32-bit int's for legacy reasons...
It's a different issue if you are talking about pointers because, unless
I compile and run 32-bit binaries, I will be using 64-bit pointers when
compiling an running "blat" under 64-bit Linux:
#include <stdio.h>
#include <stdlib.h>
int i;
int *p;
main() {
printf("%ld\n", sizeof(i) * 8);
printf("%ld\n", sizeof(p) * 8);
}
In any case, I've now realised that "blat" has a 4GiB memory limit.
> I'm not sure
> you would want it to either. Each of the haplotypes in this
> file reproduce the entire chromosome that the haplotype is contained
> within. There are eight complete copies of chr1, five complete
> copies of chr2, etc. It has two different copies of chrY, one
> completely empty of sequence.
You're right on the mark about that and I have to admit my mistake(!)
Thanks again for your helpful advice,
Tony.
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
Genome maillist - [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome