I downloaded file ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/bigZips/hg19.2bit
and make my C++ code for interpreting it. Structures and functions I wrote are based on information about .2bit file format from this page http://genome.ucsc.edu/FAQ/FAQformat.html#format7 After several adjustments (index starts from 1 instead from 0) I got the same interpretation of all sequences as Your web site does. BUT!!! There is one state on info page about .2bit format which is NOT VALID : "The packedDna field is padded with 0 bits as necessary to take an even multiple of 32 bits in the file ...." That is not true. Otherwise, myInterpreter and your interpreter on web site show wrong data at the beginning of some sequences. My interpreter gets these: chr1 packedDNA field start address: 0x0027b4e7 chr2 packedDNA field start address: 0x0405328f chr3 packedDNA field start address: 0x07C592b3 ....... and so on It is obvious those addresses are NOT 32 bits aligned. According to documentation those addresses should be populated with 0 bits until address gets 32 bit aligned. Ignoring that but reading from original address I get result same as on your site. Also I tried assuming padding is implemented (as stated in FAQ about .2bit format) and I ignore bits from those addresses (if padding is implemented those bits should be zeros but they aren't, anyway I simply ignore it) until my counter reach to the first higher 32bit aligned address and then I get sequences differ from sequences on your site. (expected) So padding is not implemented and you should correct info about .2bit file format to be written as. "The packedDna field contains sequence of two bits data and such sequence is NOT necessary 32 bit aligned" Making explicit state the packedDna field is BYTE aligned has no sense at all since these days there is no more computers accessing memory in quantities smaller than byte, so any compiler can not make alignment of such field to address which is not byte aligned because all previous fields in structure are at least one byte aligned and all have sizes in whole byte quantities. Also you can add: "All numeric field (except nameSize which is one byte in size) are 32 bits unsigned integers". Also you should add clarification on nBlockStarts and nBlockSizes "... these are independant arrays where each of them is of nBlockCount elements". Without this, reader thinks on these like array of pairs of nBlockCount elements instead on two separate arrays. Same thing should be added for maskBlockStarts and maskBlockSizes "... these are independant arrays where each of them is of maskBlockCount elements". Finally I have one more question, On your ftp I found only .2bit file for Homo Sapiens.in BigZips folder. where can I find .2bit files for other species? best reggards, Srdjan Crnjanski _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
