[Genome] Documentation on .2bit file format structure doesn't match the real .2bit file data

Srdjan Crnjanski Fri, 24 Dec 2010 13:20:44 -0800

I downloaded file

ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/bigZips/hg19.2bit


and make my C++ code for interpreting it.

Structures and functions I wrote are based on information about .2bit
file format from this page

http://genome.ucsc.edu/FAQ/FAQformat.html#format7

After several adjustments (index starts from 1 instead from 0) I got
the same interpretation of all sequences as Your web site does.
BUT!!!
There is one state on info page about .2bit format which is NOT VALID :

"The packedDna field is padded with 0 bits as necessary to take an
even multiple of 32 bits in the file ...."

That is not true. Otherwise, myInterpreter and your interpreter on web
site show wrong data at the beginning of some sequences.

My interpreter gets these:
chr1  packedDNA field start address: 0x0027b4e7
chr2  packedDNA field start address: 0x0405328f
chr3  packedDNA field start address: 0x07C592b3
.......
and so on

It is obvious those addresses are NOT 32 bits aligned. According to
documentation those addresses should be populated with 0 bits until
address gets 32 bit aligned. Ignoring that but reading from original
address I get result same as on your site.  Also I tried assuming
padding is implemented (as stated in FAQ about .2bit format) and I
ignore bits from those addresses (if padding is implemented those bits
should be zeros but they aren't, anyway I simply ignore it) until my
counter reach to the first higher 32bit aligned address and then I get
sequences differ from sequences on your site. (expected)
So padding is not implemented and you should correct info about .2bit
file format to be written as.

"The packedDna field contains sequence of two bits data and such
sequence is NOT necessary 32 bit aligned"

Making explicit state the packedDna field is BYTE aligned has no sense
at all since these days there is no more computers accessing memory in
quantities smaller than byte, so any compiler can not make alignment
of such field to address which is not byte aligned because all
previous fields in structure are at least one byte aligned and all
have sizes in whole byte quantities.

Also you can add:

"All numeric field (except nameSize which is one byte in size) are 32
bits unsigned integers".

Also you should add clarification on nBlockStarts and nBlockSizes "...
these are independant arrays where each of them is of nBlockCount
elements". Without this, reader thinks on these like array of pairs of
nBlockCount elements instead on two separate arrays.
Same thing should be added for maskBlockStarts and maskBlockSizes "...
these are independant arrays where each of them is of maskBlockCount
elements".

Finally I have one more question,
On your ftp I found only .2bit file for Homo Sapiens.in BigZips folder.
where can I find .2bit files for other species?

best reggards,
Srdjan Crnjanski
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

[Genome] Documentation on .2bit file format structure doesn't match the real .2bit file data

Reply via email to