Re: [Genome] Documentation on .2bit file format structure doesn't match the real .2bit file data

Hiram Clawson Mon, 27 Dec 2010 11:21:31 -0800

Good Morning Srdjan Crnjanski:

Thank you for pointing out the errors in our documentation.


Can you clarify which index starts at 1 instead of 0 ?

For your future reference, you can view the source code to verify
the C structures used in the 2bit files:
http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/inc/twoBit.h

To download all the 2bit files, note this shell script command:

http://genomewiki.ucsc.edu/index.php/Download_All_Genomes

--Hiram

Srdjan Crnjanski wrote:
> I downloaded file
> 
> ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/bigZips/hg19.2bit
> 
> and make my C++ code for interpreting it.
> 
> Structures and functions I wrote are based on information about .2bit
> file format from this page
> 
> http://genome.ucsc.edu/FAQ/FAQformat.html#format7
> 
> After several adjustments (index starts from 1 instead from 0) I got
> the same interpretation of all sequences as Your web site does.
> BUT!!!
> There is one state on info page about .2bit format which is NOT VALID :
> 
> "The packedDna field is padded with 0 bits as necessary to take an
> even multiple of 32 bits in the file ...."
> 
> That is not true. Otherwise, myInterpreter and your interpreter on web
> site show wrong data at the beginning of some sequences.
> 
> My interpreter gets these:
> chr1  packedDNA field start address: 0x0027b4e7
> chr2  packedDNA field start address: 0x0405328f
> chr3  packedDNA field start address: 0x07C592b3
> .......
> and so on
> 
> It is obvious those addresses are NOT 32 bits aligned. According to
> documentation those addresses should be populated with 0 bits until
> address gets 32 bit aligned. Ignoring that but reading from original
> address I get result same as on your site.  Also I tried assuming
> padding is implemented (as stated in FAQ about .2bit format) and I
> ignore bits from those addresses (if padding is implemented those bits
> should be zeros but they aren't, anyway I simply ignore it) until my
> counter reach to the first higher 32bit aligned address and then I get
> sequences differ from sequences on your site. (expected)
> So padding is not implemented and you should correct info about .2bit
> file format to be written as.
> 
> "The packedDna field contains sequence of two bits data and such
> sequence is NOT necessary 32 bit aligned"
> 
> Making explicit state the packedDna field is BYTE aligned has no sense
> at all since these days there is no more computers accessing memory in
> quantities smaller than byte, so any compiler can not make alignment
> of such field to address which is not byte aligned because all
> previous fields in structure are at least one byte aligned and all
> have sizes in whole byte quantities.
> 
> Also you can add:
> 
> "All numeric field (except nameSize which is one byte in size) are 32
> bits unsigned integers".
> 
> Also you should add clarification on nBlockStarts and nBlockSizes "...
> these are independant arrays where each of them is of nBlockCount
> elements". Without this, reader thinks on these like array of pairs of
> nBlockCount elements instead on two separate arrays.
> Same thing should be added for maskBlockStarts and maskBlockSizes "...
> these are independant arrays where each of them is of maskBlockCount
> elements".
> 
> Finally I have one more question,
> On your ftp I found only .2bit file for Homo Sapiens.in BigZips folder.
> where can I find .2bit files for other species?
> 
> best reggards,
> Srdjan Crnjanski
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Documentation on .2bit file format structure doesn't match the real .2bit file data

Reply via email to