hi, Angie
Can I know which version the Ensembl gene track in UCSC is from? I check
the ensGene table, there is no such version information there, nor the
ensPep, ensGtp tables.
Thanks
Xianjun
Angie Hinrichs wrote:
> Hi Xianjun,
>
> Sorry about the delayed response. Comments embedded:
>
>
>> 1. I installed kent's source tree, and the blastz program. I also made
>> the .2bit file for zebrafish (Zv8), and the chromosome size file. What
>> else I need to get the selfChain?
>>
>
> doBlastzChainNet.pl runs compute cluster batches using the parasol
> system. The script creates files that list query and target sequences
> and a template file for generating command specs for cluster jobs.
> Then it calls the gensub2 program to generate the command specs, and
> invokes the parasol commands in HgAutomate.pm's $paraRun to run the
> cluster batch. You probably have a different infrastructure for your
> local compute cluster, so all of that code will need to be adapted.
>
> There are 3 cluster batches: one very large cluster batch to run
> thousands of blat jobs on pairs of sequence chunks (for zebrafish,
> probably just one thousand jobs). Then there is a smaller batch
> (~300) that concatenates results for each target sequence across all
> of the query sequence chunks, and another smaller cluster batch (~100)
> that runs axtChain to create the chains.
>
> All of the scripts in kent/src/hg/utils/automation/ should be
> installed somewhere in the PATH specified in your DEF file. I think
> you have all of the required programs, but of course let us know if
> you get any "command not found"s.
>
>
>
>> 2. I made a DEF file like the following. Could you check if there is
>> anything missed/wrong?
>>
>
> That looks very good -- there is only one thing I would add, because
> the zebrafish assembly includes many small scaffold sequences. (If
> that is not true of the new assembly, then never mind!)
>
> SEQ1_LIMIT=30
>
> SEQ2_LIMIT=100
>
> That will keep the run time of your jobs a bit more balanced --
> without SEQ1_LIMIT especially, some blastz jobs could run for a very
> long time.
>
>
>
>> 3. Luckily, I grep piece of code written by Hiram 3 years ago, in the
>> ~/kent/src/hg/makeDb/doc/hg18.txt, which would help me a lot (I copied
>> here, see below). But I have two questions to this:
>> 1). Do I have to run the pipeline in the cluster? Can I just run it
>> on a server, in case I don't have assess to a cluster? If so, how should
>> I set parameters for the pipeline?
>>
>
> Oh wow. If you have only one server, it could take months to run all
> the blastz jobs, assuming they run consecutively and don't have
> errors. Whole-genome blastz of sizeable genomes really requires a
> good compute cluster.
>
>
>
>> 2). After running the doBlastzChainNet.pl, it seems you ssh to
>> another machine ("ssh kolossus") and run the featureBits. What's the
>> purpose for this? Do I have to include this part if I just want to make
>> the selfChain data?
>>
>
> featureBits tells how many non-gap bases of the reference genome are
> covered by at least one aligned block of the chains. You don't need
> to do that unless you're interested in that statistic.
>
>
>
>> 3). If I want to make a chainSelf table in MySQL (like the table in
>> ucsc), what additional script I should run for that?
>>
>
> doBlastz includes that (search for hgLoadChain).
>
>
>
>> 4. Last question, I noticed that in your description page of self Chain,
>> it says to use a specific matrix for the dynamic program which was run
>> over the kd-trees to find the maximally scoring chains of these blocks.
>> But the matrix is not given in the tetraodon selfChain page. Can I know
>> how I should set the matrix for making zebrafish selfChain? Or, this
>> does not matter at all, for the integrated doBlastzChainNet.pl
>> pipeline?
>>
>
> For self chains, we just use the default matrix.
>
>
>
>> The reason to ask this question is, since the selfChain is
>> mainly for detecting the paralog part in the genome, and most of those
>> are from duplication (whole-genome duplication like in teleost file, or
>> tandem duplication locally), the matrix for scoring the chain should
>> somehow measure the distance from the split point (e.g. when WGD
>> happened) to now in different genome. I guess this should be different
>> for zebrafish (where the WGD happened 300-450 Mya) and human (where the
>> 2R WGD happened much earlier). But how UCSC set the substitution(?)
>> matrix, I have no clue. Like to hear option from you.
>>
>
> We don't have such a rigorous process, actually -- the results with
> the default matrix are good enough for our purposes. If/when they're
> not, we do some trial-and-error.
>
> An aside about the default blastz matrix: it was tuned for human-mouse
> (or mammal-mammal), and I think it is a bit less sensitive than it
> could be for cross-species comparisons of genomes that don't have so
> many repetitive sequences as mammalian genomes. I.e. with too high
> sensitivity, mammalian genomes have so many matches of low-complexity
> or incompletely masked repetitive sequences that the output is too
> huge to store and display. When comparing genomes that are not so
> repetitive (e.g. flies), we have been able to increase the sensitivity
> without getting swamped by results. For cross-species involving
> non-mammalian genomes, we often use a matrix that was tuned on the
> HoxD55 region of many genomes by Webb Miller('s group?) some time in
> 2002 or earlier. Locally we call the file HoxD55.q and its contents
> are these 5 lines:
>
> A C G T
> 91 -90 -25 -100
> -90 100 -100 -25
> -25 -100 100 -90
> -100 -25 -90 91
>
> In addition to the matrix, blastz has several other parameters that
> adjust sensitivity -- see the README.pdf that comes with the blastz
> package.
>
> I would suggest trying with default parameters, and then with more
> sensitive parameters if you suspect there could be more results than
> initially found. Of course, this experimentation really requires the
> luxury of a big compute cluster!
>
> Finally, there is another wiki page that may be helpful -- it was
> written with cross-species alignments in mind, but it is a great
> description of how someone was able to get a simplified version of our
> pipeline running on his system for a small genome (150Mbase):
>
> http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto
>
> Hope that helps, and please send any further questions to
> [email protected] -- we will reply more promptly next time!
>
> Angie
>
>
>> Sorry for too many questions :) Thanks for any help
>>
>> -Xianjun
>>
>> ============ sample code from ~/kent/src/hg/makeDb/doc/hg18.txt
>> =================
>>
>> # BLASTZ SELF (DONE - 2006-01-17 - 2006-01-20 - Hiram)
>>
>> ssh pk
>> mkdir /cluster/data/hg18/bed/blastzSelf.2006-01-17
>> cd /cluster/data/hg18/bed/blastzSelf.2006-01-17
>>
>> # prepare the DEF file
>>
>> cd /cluster/data/hg18/bed/blastzSelf.2006-01-17
>> time /cluster/bin/scripts/doBlastzChainNet.pl -verbose=2 \
>> -chainMinScore=10000 -chainLinearGap=medium -bigClusterHub=pk \
>> `pwd`/DEF > blastz.out 2>&1 &
>> # real 640m37.637s
>>
>> ssh kolossus
>> cd /cluster/data/hg18/bed/blastzSelf.2006-01-17
>> time HGDB_CONF=~/.hg.conf.read-only featureBits \
>> -noRandom -noHap hg18 chainSelfLink > fb.chainSelfLink 2>&1 &
>> # real 21m52.697s
>> # 324067552 bases of 2858034764 (11.339%) in intersection
>>
>>
>>
--
==========================================
Xianjun Dong
PhD student, Lenhard group
Computational Biology Unit
Bergen Center for Computational Science
University of Bergen
Hoyteknologisenteret, Thormohlensgate 55
N-5008 Bergen, Norway
E-mail: [email protected]
Tel.: +47 555 84022
Fax : +47 555 84295
==========================================
_______________________________________________
Genome maillist - [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome