Hi Xianjun, Reading the code is not a bad idea -- in fact, it's the only way I can answer your question -- but now that I have read it, the short answer is that we translate the coordinates of aligned blocks (links) only, not gaps, because we don't have enough information to reliably translate coordinates in the middle of gaps. The "part of chain that is in this window" is just the aligned links in the window.
If you would still like to see the code, start in kent/src/hg/hgc/hgc.c, function chainToOtherBrowser(). First it calls chainSubsetOnT() to chop the chain to the portion in the window, by target coordinates. Then it calls qChainRangePlusStrand(), which returns the overall query start and end of the chopped chain. chainSubsetOnT() is in kent/src/lib/chain.c. First it finds the first target link whose tEnd is greater than the start of the window. Then as chainFastSubsetOnT() it clips the start of the first aligned link to the start of the window if necessary, finds the last link in the window and clips that if necessary. It builds a new chain from the processed links. Score is scaled by the ratio of the new span to the original span -- maybe crude, but the best we can do without rescoring the new chain, which would be too computationally intensive. I guess you could do your own linear scaling of partial-gap sizes when there are gaps at the start or end of your window. That will give you coordinates of sequences that blastz was not able to align, but that are adjacent to aligned sequences. Whether those sequences should be paired or not, who knows -- I guess that's why we translate by aligned blocks only. Hope that helps, Angie On Thu, 11 Jun 2009, Xianjun Dong wrote: > Hi, > > I am curious how UCSC website locate the position (in species B) > corresponding to the part of chain that is in a window (in species A). > > To explain my question clearly, I have an example here. For instance, if > you open the zebrafish chain alignment track in the following human > locus page: > http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg18&position=chr11:31787916-31796585 > You can see there are two chains under the region. Clicking one of them, > we will see another page: > http://genome.ucsc.edu/cgi-bin/hgc?hgsid=134848741&o=31083313&t=33335216&g=chainDanRer5&i=124&c=chr11&l=31787915&r=31796585&db=hg18&pix=900 > On this page, there is a link for "Open Zebrafish browser > <http://genome.ucsc.edu/cgi-bin/hgTracks?db=danRer5&ct=&position=chr25%3A15181364-15184384> > > at position corresponding to the part of chain that is in this window." > Clicking that will exactly bring us to the zebrafish > region(chr25:15181364-15184384), which is exactly the region I am > interested to get by script. It seems that you have the trick to cut the > chain accurately. > > OK, my question is: how do you make this done so exactly and so quickly? > Esp. when there are double lines at the end of the query region, which > represent more complex gaps involving substantial sequence in both species. > > This might be answered simply by "UCSC are using a super power > server....and well-structured databases". > > Well, I guess this might be one of the main reasons. But, beside that, I > am curious UCSC must be using some tricky way to get it quickly. I've > looked into the Chain (and chainLink) table. Only alignable blocks are > stored there, which means it's no problem to get the corresponding > position if the start/end position of the query reside inside the > blocks. But how about if it resides in the complex gap (double line in > the chain track)? How can you know exactly where its corresponding > position in zebrafish, for example? > > Can I know the script, or the way how you make it? > > Hope I explain my question clearly > > Thanks > > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
