Dear Genome Browser Wizard,

I have been comparing the description of chains in *.net and *.chain files.
 In particular, I compared the chains in the following files

hg18-to-mm9 net file:
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsMm9/hg18.mm9.net.gz
hg18-to-mm9 all.chain file:
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsMm9/hg18.mm9.all.chain.gz
hg18-to-mm9 liftOver over.chain file:
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/liftOver/hg18ToMm9.over.chain.gz

I noticed that the over.chain and all.chain files are consistent in their
chain descriptions (the chromosome, start position, and length in the query
genome, i.e. mm9), but both of those files are inconsistent with the
description in the net file.  I made this conclusion after I extracted
chain descriptions from every file using the commands

sed -e 's/^[[:space:]]*//' < hg18.mm9.net | grep fill | cut -d ' '
-f4,6,7,9 | sort -k4n > net_chains
grep chain hg18.mm9.all.chain | cut -d ' ' -f8,11-13 | sort -k4n | awk
'{print $1, $2, $3-$2, $4}' > all.chain_chains
grep chain hg18ToMm9.over.chain | cut -d ' ' -f8,11-13 | sort -k4n | awk
'{print $1, $2, $3-$2, $4}' > over.chain_chains

These commands generate 4-column files where each row describes a chain and
is of the format "<chromosome> <start_pos> <length> <chain_id>".   The rows
are also sorted by chain id in ascending order.  Some chains, such as the
chain with id 1, have the same descriptions across all three files, but
many chains do not.  For example, the chain with id 6 has start
position 86799633 according to the net file, but it has start position 11
according to the all.chain and over.chain files.  Interestingly, all three
files are consistent in saying that this chain is in chromosome 3 and has
length 72800139.  More generally, it appears that the start positions are
inconsistent for many chains, but the chromosome and length is consistent
for all chains.

My understanding is that net, all.chain, and over.chain files were
generated using the same original chains and assignment of chain id's, so I
am confused why there is this inconsistency.  Please let me know if I am
not understanding something correctly or my analysis is flawed.  Thank you!

Best,
Michael
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to