Hi Xianjun,
Sorry about the delayed response. Comments embedded:
> 1. I installed kent's source tree, and the blastz program. I also made
> the .2bit file for zebrafish (Zv8), and the chromosome size file. What
> else I need to get the selfChain?
doBlastzChainNet.pl runs compute cluster batches using the parasol
system. The script creates files that list query and target sequences
and a template file for generating command specs for cluster jobs.
Then it calls the gensub2 program to generate the command specs, and
invokes the parasol commands in HgAutomate.pm's $paraRun to run the
cluster batch. You probably have a different infrastructure for your
local compute cluster, so all of that code will need to be adapted.
There are 3 cluster batches: one very large cluster batch to run
thousands of blat jobs on pairs of sequence chunks (for zebrafish,
probably just one thousand jobs). Then there is a smaller batch
(~300) that concatenates results for each target sequence across all
of the query sequence chunks, and another smaller cluster batch (~100)
that runs axtChain to create the chains.
All of the scripts in kent/src/hg/utils/automation/ should be
installed somewhere in the PATH specified in your DEF file. I think
you have all of the required programs, but of course let us know if
you get any "command not found"s.
> 2. I made a DEF file like the following. Could you check if there is
> anything missed/wrong?
That looks very good -- there is only one thing I would add, because
the zebrafish assembly includes many small scaffold sequences. (If
that is not true of the new assembly, then never mind!)
SEQ1_LIMIT=30
SEQ2_LIMIT=100
That will keep the run time of your jobs a bit more balanced --
without SEQ1_LIMIT especially, some blastz jobs could run for a very
long time.
> 3. Luckily, I grep piece of code written by Hiram 3 years ago, in the
> ~/kent/src/hg/makeDb/doc/hg18.txt, which would help me a lot (I copied
> here, see below). But I have two questions to this:
> 1). Do I have to run the pipeline in the cluster? Can I just run it
> on a server, in case I don't have assess to a cluster? If so, how should
> I set parameters for the pipeline?
Oh wow. If you have only one server, it could take months to run all
the blastz jobs, assuming they run consecutively and don't have
errors. Whole-genome blastz of sizeable genomes really requires a
good compute cluster.
> 2). After running the doBlastzChainNet.pl, it seems you ssh to
> another machine ("ssh kolossus") and run the featureBits. What's the
> purpose for this? Do I have to include this part if I just want to make
> the selfChain data?
featureBits tells how many non-gap bases of the reference genome are
covered by at least one aligned block of the chains. You don't need
to do that unless you're interested in that statistic.
> 3). If I want to make a chainSelf table in MySQL (like the table in
> ucsc), what additional script I should run for that?
doBlastz includes that (search for hgLoadChain).
> 4. Last question, I noticed that in your description page of self Chain,
> it says to use a specific matrix for the dynamic program which was run
> over the kd-trees to find the maximally scoring chains of these blocks.
> But the matrix is not given in the tetraodon selfChain page. Can I know
> how I should set the matrix for making zebrafish selfChain? Or, this
> does not matter at all, for the integrated doBlastzChainNet.pl
> pipeline?
For self chains, we just use the default matrix.
> The reason to ask this question is, since the selfChain is
> mainly for detecting the paralog part in the genome, and most of those
> are from duplication (whole-genome duplication like in teleost file, or
> tandem duplication locally), the matrix for scoring the chain should
> somehow measure the distance from the split point (e.g. when WGD
> happened) to now in different genome. I guess this should be different
> for zebrafish (where the WGD happened 300-450 Mya) and human (where the
> 2R WGD happened much earlier). But how UCSC set the substitution(?)
> matrix, I have no clue. Like to hear option from you.
We don't have such a rigorous process, actually -- the results with
the default matrix are good enough for our purposes. If/when they're
not, we do some trial-and-error.
An aside about the default blastz matrix: it was tuned for human-mouse
(or mammal-mammal), and I think it is a bit less sensitive than it
could be for cross-species comparisons of genomes that don't have so
many repetitive sequences as mammalian genomes. I.e. with too high
sensitivity, mammalian genomes have so many matches of low-complexity
or incompletely masked repetitive sequences that the output is too
huge to store and display. When comparing genomes that are not so
repetitive (e.g. flies), we have been able to increase the sensitivity
without getting swamped by results. For cross-species involving
non-mammalian genomes, we often use a matrix that was tuned on the
HoxD55 region of many genomes by Webb Miller('s group?) some time in
2002 or earlier. Locally we call the file HoxD55.q and its contents
are these 5 lines:
A C G T
91 -90 -25 -100
-90 100 -100 -25
-25 -100 100 -90
-100 -25 -90 91
In addition to the matrix, blastz has several other parameters that
adjust sensitivity -- see the README.pdf that comes with the blastz
package.
I would suggest trying with default parameters, and then with more
sensitive parameters if you suspect there could be more results than
initially found. Of course, this experimentation really requires the
luxury of a big compute cluster!
Finally, there is another wiki page that may be helpful -- it was
written with cross-species alignments in mind, but it is a great
description of how someone was able to get a simplified version of our
pipeline running on his system for a small genome (150Mbase):
http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto
Hope that helps, and please send any further questions to
[email protected] -- we will reply more promptly next time!
Angie
> Sorry for too many questions :) Thanks for any help
>
> -Xianjun
>
> ============ sample code from ~/kent/src/hg/makeDb/doc/hg18.txt
> =================
>
> # BLASTZ SELF (DONE - 2006-01-17 - 2006-01-20 - Hiram)
>
> ssh pk
> mkdir /cluster/data/hg18/bed/blastzSelf.2006-01-17
> cd /cluster/data/hg18/bed/blastzSelf.2006-01-17
>
> # prepare the DEF file
>
> cd /cluster/data/hg18/bed/blastzSelf.2006-01-17
> time /cluster/bin/scripts/doBlastzChainNet.pl -verbose=2 \
> -chainMinScore=10000 -chainLinearGap=medium -bigClusterHub=pk \
> `pwd`/DEF > blastz.out 2>&1 &
> # real 640m37.637s
>
> ssh kolossus
> cd /cluster/data/hg18/bed/blastzSelf.2006-01-17
> time HGDB_CONF=~/.hg.conf.read-only featureBits \
> -noRandom -noHap hg18 chainSelfLink > fb.chainSelfLink 2>&1 &
> # real 21m52.697s
> # 324067552 bases of 2858034764 (11.339%) in intersection
>
>
_______________________________________________
Genome maillist - [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome