Good Morning Srdjan Crnjanski: Thank you for pointing out the errors in our documentation.
Can you clarify which index starts at 1 instead of 0 ? For your future reference, you can view the source code to verify the C structures used in the 2bit files: http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/inc/twoBit.h To download all the 2bit files, note this shell script command: http://genomewiki.ucsc.edu/index.php/Download_All_Genomes --Hiram Srdjan Crnjanski wrote: > I downloaded file > > ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/bigZips/hg19.2bit > > and make my C++ code for interpreting it. > > Structures and functions I wrote are based on information about .2bit > file format from this page > > http://genome.ucsc.edu/FAQ/FAQformat.html#format7 > > After several adjustments (index starts from 1 instead from 0) I got > the same interpretation of all sequences as Your web site does. > BUT!!! > There is one state on info page about .2bit format which is NOT VALID : > > "The packedDna field is padded with 0 bits as necessary to take an > even multiple of 32 bits in the file ...." > > That is not true. Otherwise, myInterpreter and your interpreter on web > site show wrong data at the beginning of some sequences. > > My interpreter gets these: > chr1 packedDNA field start address: 0x0027b4e7 > chr2 packedDNA field start address: 0x0405328f > chr3 packedDNA field start address: 0x07C592b3 > ....... > and so on > > It is obvious those addresses are NOT 32 bits aligned. According to > documentation those addresses should be populated with 0 bits until > address gets 32 bit aligned. Ignoring that but reading from original > address I get result same as on your site. Also I tried assuming > padding is implemented (as stated in FAQ about .2bit format) and I > ignore bits from those addresses (if padding is implemented those bits > should be zeros but they aren't, anyway I simply ignore it) until my > counter reach to the first higher 32bit aligned address and then I get > sequences differ from sequences on your site. (expected) > So padding is not implemented and you should correct info about .2bit > file format to be written as. > > "The packedDna field contains sequence of two bits data and such > sequence is NOT necessary 32 bit aligned" > > Making explicit state the packedDna field is BYTE aligned has no sense > at all since these days there is no more computers accessing memory in > quantities smaller than byte, so any compiler can not make alignment > of such field to address which is not byte aligned because all > previous fields in structure are at least one byte aligned and all > have sizes in whole byte quantities. > > Also you can add: > > "All numeric field (except nameSize which is one byte in size) are 32 > bits unsigned integers". > > Also you should add clarification on nBlockStarts and nBlockSizes "... > these are independant arrays where each of them is of nBlockCount > elements". Without this, reader thinks on these like array of pairs of > nBlockCount elements instead on two separate arrays. > Same thing should be added for maskBlockStarts and maskBlockSizes "... > these are independant arrays where each of them is of maskBlockCount > elements". > > Finally I have one more question, > On your ftp I found only .2bit file for Homo Sapiens.in BigZips folder. > where can I find .2bit files for other species? > > best reggards, > Srdjan Crnjanski _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
