Re: [Genome] self-chain for zebrafish

Xianjun Dong Wed, 24 Jun 2009 08:26:20 -0700

hi, Angie

Can I know which version the Ensembl gene track in UCSC is from? I check 
the ensGene table, there is no such version information there, nor the 
ensPep, ensGtp tables.


Thanks

Xianjun



Angie Hinrichs wrote:
> Hi Xianjun,
>
> Sorry about the delayed response.  Comments embedded:
>
>   
>> 1. I installed kent's source tree, and the blastz program. I also made
>> the .2bit file for zebrafish (Zv8), and the chromosome size file. What
>> else I need to get the selfChain?
>>     
>
> doBlastzChainNet.pl runs compute cluster batches using the parasol 
> system.  The script creates files that list query and target sequences 
> and a template file for generating command specs for cluster jobs.  
> Then it calls the gensub2 program to generate the command specs, and 
> invokes the parasol commands in HgAutomate.pm's $paraRun to run the 
> cluster batch.  You probably have a different infrastructure for your 
> local compute cluster, so all of that code will need to be adapted.
>
> There are 3 cluster batches: one very large cluster batch to run 
> thousands of blat jobs on pairs of sequence chunks (for zebrafish, 
> probably just one thousand jobs).  Then there is a smaller batch 
> (~300) that concatenates results for each target sequence across all 
> of the query sequence chunks, and another smaller cluster batch (~100) 
> that runs axtChain to create the chains.
>
> All of the scripts in kent/src/hg/utils/automation/ should be 
> installed somewhere in the PATH specified in your DEF file.  I think 
> you have all of the required programs, but of course let us know if 
> you get any "command not found"s.  
>
>
>   
>> 2. I made a DEF file like the following. Could you check if there is
>> anything missed/wrong?
>>     
>
> That looks very good -- there is only one thing I would add, because 
> the zebrafish assembly includes many small scaffold sequences.  (If 
> that is not true of the new assembly, then never mind!)  
>
> SEQ1_LIMIT=30
>
> SEQ2_LIMIT=100
>
> That will keep the run time of your jobs a bit more balanced -- 
> without SEQ1_LIMIT especially, some blastz jobs could run for a very 
> long time.  
>
>
>   
>> 3. Luckily, I grep piece of code written by Hiram 3 years ago, in the
>> ~/kent/src/hg/makeDb/doc/hg18.txt, which would help me a lot (I copied
>> here, see below). But I have two questions to this:
>>     1). Do I have to run the pipeline in the cluster? Can I just run it
>> on a server, in case I don't have assess to a cluster? If so, how should
>> I set parameters for the pipeline?
>>     
>
> Oh wow.  If you have only one server, it could take months to run all
> the blastz jobs, assuming they run consecutively and don't have
> errors.  Whole-genome blastz of sizeable genomes really requires a
> good compute cluster.
>
>
>   
>>     2). After running the doBlastzChainNet.pl, it seems you ssh to
>> another machine ("ssh kolossus") and run the featureBits. What's the
>> purpose for this? Do I have to include this part if I just want to make
>> the selfChain data?
>>     
>
> featureBits tells how many non-gap bases of the reference genome are
> covered by at least one aligned block of the chains.  You don't need
> to do that unless you're interested in that statistic.
>
>
>   
>>     3). If I want to make a chainSelf table in MySQL (like the table in
>> ucsc), what additional script I should run for that?
>>     
>
> doBlastz includes that (search for hgLoadChain).
>
>
>   
>> 4. Last question, I noticed that in your description page of self Chain,
>> it says to use a specific matrix for the dynamic program which was run
>> over the kd-trees to find the maximally scoring chains of these blocks.
>> But the matrix is not given in the tetraodon selfChain page. Can I know
>> how I should set the matrix for making zebrafish selfChain? Or, this
>> does not matter at all, for the integrated doBlastzChainNet.pl
>> pipeline?  
>>     
>
> For self chains, we just use the default matrix.  
>
>
>   
>> The reason to ask this question is, since the selfChain is
>> mainly for detecting the paralog part in the genome, and most of those
>> are from duplication (whole-genome duplication like in teleost file, or
>> tandem duplication locally), the matrix for scoring the chain should
>> somehow measure the distance from the split point (e.g. when WGD
>> happened) to now in different genome. I guess this should be different
>> for zebrafish (where the WGD happened 300-450 Mya)  and human (where the
>> 2R WGD happened much earlier). But how UCSC set the substitution(?)
>> matrix, I have no clue. Like to hear option from you.
>>     
>
> We don't have such a rigorous process, actually -- the results with
> the default matrix are good enough for our purposes.  If/when they're
> not, we do some trial-and-error.  
>
> An aside about the default blastz matrix: it was tuned for human-mouse
> (or mammal-mammal), and I think it is a bit less sensitive than it
> could be for cross-species comparisons of genomes that don't have so
> many repetitive sequences as mammalian genomes.  I.e. with too high
> sensitivity, mammalian genomes have so many matches of low-complexity
> or incompletely masked repetitive sequences that the output is too
> huge to store and display.  When comparing genomes that are not so
> repetitive (e.g. flies), we have been able to increase the sensitivity
> without getting swamped by results.  For cross-species involving
> non-mammalian genomes, we often use a matrix that was tuned on the
> HoxD55 region of many genomes by Webb Miller('s group?) some time in
> 2002 or earlier.  Locally we call the file HoxD55.q and its contents
> are these 5 lines:
>
>      A    C    G    T
>     91  -90  -25 -100
>    -90  100 -100  -25
>    -25 -100  100  -90
>   -100  -25  -90  91
>
> In addition to the matrix, blastz has several other parameters that
> adjust sensitivity -- see the README.pdf that comes with the blastz
> package.
>
> I would suggest trying with default parameters, and then with more
> sensitive parameters if you suspect there could be more results than
> initially found.  Of course, this experimentation really requires the
> luxury of a big compute cluster!  
>
> Finally, there is another wiki page that may be helpful -- it was
> written with cross-species alignments in mind, but it is a great
> description of how someone was able to get a simplified version of our
> pipeline running on his system for a small genome (150Mbase):
>
> http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto
>
> Hope that helps, and please send any further questions to
> [email protected] -- we will reply more promptly next time!  
>
> Angie
>
>   
>> Sorry for too many questions :) Thanks for any help
>>
>> -Xianjun
>>
>> ============ sample code from ~/kent/src/hg/makeDb/doc/hg18.txt
>> =================
>>
>> #  BLASTZ SELF (DONE - 2006-01-17 - 2006-01-20 - Hiram)
>>
>>     ssh pk
>>     mkdir /cluster/data/hg18/bed/blastzSelf.2006-01-17
>>     cd /cluster/data/hg18/bed/blastzSelf.2006-01-17
>>
>>   # prepare the DEF file
>>
>>   cd /cluster/data/hg18/bed/blastzSelf.2006-01-17
>>     time /cluster/bin/scripts/doBlastzChainNet.pl -verbose=2 \
>>         -chainMinScore=10000 -chainLinearGap=medium -bigClusterHub=pk \
>>         `pwd`/DEF > blastz.out 2>&1 &
>>     #   real    640m37.637s
>>
>>     ssh kolossus
>>     cd /cluster/data/hg18/bed/blastzSelf.2006-01-17
>>     time HGDB_CONF=~/.hg.conf.read-only featureBits \
>>         -noRandom -noHap hg18 chainSelfLink > fb.chainSelfLink 2>&1 &
>>     #   real    21m52.697s
>>     #   324067552 bases of 2858034764 (11.339%) in intersection
>>
>>
>>     

-- 
==========================================
Xianjun Dong
PhD student, Lenhard group
Computational Biology Unit
Bergen Center for Computational Science
University of Bergen
Hoyteknologisenteret, Thormohlensgate 55
N-5008 Bergen, Norway
E-mail: [email protected]
Tel.: +47 555 84022
Fax : +47 555 84295
==========================================

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] self-chain for zebrafish

Reply via email to