Re: [Genome] self-chain for zebrafish

Katrina Learned Wed, 24 Jun 2009 08:56:12 -0700

Hi Xianjun,

To find out which Ensembl version we are using , turn on the Ensembl 
Genes track in full mode and then click on any gene to view its details. 
At the top of the details page you will see a light blue title bar 
containing the name of the track, the gene name, and the Ensembl 
version. We are currently using version 52.


Please don't hesitate to contact the mail list again if you require 
further assistance.

Katrina Learned
UCSC Genome Bioinformatics Group



Xianjun Dong wrote:
> hi, Angie
>
> Can I know which version the Ensembl gene track in UCSC is from? I check 
> the ensGene table, there is no such version information there, nor the 
> ensPep, ensGtp tables.
>
> Thanks
>
> Xianjun
>
>
>
> Angie Hinrichs wrote:
>   
>> Hi Xianjun,
>>
>> Sorry about the delayed response.  Comments embedded:
>>
>>   
>>     
>>> 1. I installed kent's source tree, and the blastz program. I also made
>>> the .2bit file for zebrafish (Zv8), and the chromosome size file. What
>>> else I need to get the selfChain?
>>>     
>>>       
>> doBlastzChainNet.pl runs compute cluster batches using the parasol 
>> system.  The script creates files that list query and target sequences 
>> and a template file for generating command specs for cluster jobs.  
>> Then it calls the gensub2 program to generate the command specs, and 
>> invokes the parasol commands in HgAutomate.pm's $paraRun to run the 
>> cluster batch.  You probably have a different infrastructure for your 
>> local compute cluster, so all of that code will need to be adapted.
>>
>> There are 3 cluster batches: one very large cluster batch to run 
>> thousands of blat jobs on pairs of sequence chunks (for zebrafish, 
>> probably just one thousand jobs).  Then there is a smaller batch 
>> (~300) that concatenates results for each target sequence across all 
>> of the query sequence chunks, and another smaller cluster batch (~100) 
>> that runs axtChain to create the chains.
>>
>> All of the scripts in kent/src/hg/utils/automation/ should be 
>> installed somewhere in the PATH specified in your DEF file.  I think 
>> you have all of the required programs, but of course let us know if 
>> you get any "command not found"s.  
>>
>>
>>   
>>     
>>> 2. I made a DEF file like the following. Could you check if there is
>>> anything missed/wrong?
>>>     
>>>       
>> That looks very good -- there is only one thing I would add, because 
>> the zebrafish assembly includes many small scaffold sequences.  (If 
>> that is not true of the new assembly, then never mind!)  
>>
>> SEQ1_LIMIT=30
>>
>> SEQ2_LIMIT=100
>>
>> That will keep the run time of your jobs a bit more balanced -- 
>> without SEQ1_LIMIT especially, some blastz jobs could run for a very 
>> long time.  
>>
>>
>>   
>>     
>>> 3. Luckily, I grep piece of code written by Hiram 3 years ago, in the
>>> ~/kent/src/hg/makeDb/doc/hg18.txt, which would help me a lot (I copied
>>> here, see below). But I have two questions to this:
>>>     1). Do I have to run the pipeline in the cluster? Can I just run it
>>> on a server, in case I don't have assess to a cluster? If so, how should
>>> I set parameters for the pipeline?
>>>     
>>>       
>> Oh wow.  If you have only one server, it could take months to run all
>> the blastz jobs, assuming they run consecutively and don't have
>> errors.  Whole-genome blastz of sizeable genomes really requires a
>> good compute cluster.
>>
>>
>>   
>>     
>>>     2). After running the doBlastzChainNet.pl, it seems you ssh to
>>> another machine ("ssh kolossus") and run the featureBits. What's the
>>> purpose for this? Do I have to include this part if I just want to make
>>> the selfChain data?
>>>     
>>>       
>> featureBits tells how many non-gap bases of the reference genome are
>> covered by at least one aligned block of the chains.  You don't need
>> to do that unless you're interested in that statistic.
>>
>>
>>   
>>     
>>>     3). If I want to make a chainSelf table in MySQL (like the table in
>>> ucsc), what additional script I should run for that?
>>>     
>>>       
>> doBlastz includes that (search for hgLoadChain).
>>
>>
>>   
>>     
>>> 4. Last question, I noticed that in your description page of self Chain,
>>> it says to use a specific matrix for the dynamic program which was run
>>> over the kd-trees to find the maximally scoring chains of these blocks.
>>> But the matrix is not given in the tetraodon selfChain page. Can I know
>>> how I should set the matrix for making zebrafish selfChain? Or, this
>>> does not matter at all, for the integrated doBlastzChainNet.pl
>>> pipeline?  
>>>     
>>>       
>> For self chains, we just use the default matrix.  
>>
>>
>>   
>>     
>>> The reason to ask this question is, since the selfChain is
>>> mainly for detecting the paralog part in the genome, and most of those
>>> are from duplication (whole-genome duplication like in teleost file, or
>>> tandem duplication locally), the matrix for scoring the chain should
>>> somehow measure the distance from the split point (e.g. when WGD
>>> happened) to now in different genome. I guess this should be different
>>> for zebrafish (where the WGD happened 300-450 Mya)  and human (where the
>>> 2R WGD happened much earlier). But how UCSC set the substitution(?)
>>> matrix, I have no clue. Like to hear option from you.
>>>     
>>>       
>> We don't have such a rigorous process, actually -- the results with
>> the default matrix are good enough for our purposes.  If/when they're
>> not, we do some trial-and-error.  
>>
>> An aside about the default blastz matrix: it was tuned for human-mouse
>> (or mammal-mammal), and I think it is a bit less sensitive than it
>> could be for cross-species comparisons of genomes that don't have so
>> many repetitive sequences as mammalian genomes.  I.e. with too high
>> sensitivity, mammalian genomes have so many matches of low-complexity
>> or incompletely masked repetitive sequences that the output is too
>> huge to store and display.  When comparing genomes that are not so
>> repetitive (e.g. flies), we have been able to increase the sensitivity
>> without getting swamped by results.  For cross-species involving
>> non-mammalian genomes, we often use a matrix that was tuned on the
>> HoxD55 region of many genomes by Webb Miller('s group?) some time in
>> 2002 or earlier.  Locally we call the file HoxD55.q and its contents
>> are these 5 lines:
>>
>>      A    C    G    T
>>     91  -90  -25 -100
>>    -90  100 -100  -25
>>    -25 -100  100  -90
>>   -100  -25  -90  91
>>
>> In addition to the matrix, blastz has several other parameters that
>> adjust sensitivity -- see the README.pdf that comes with the blastz
>> package.
>>
>> I would suggest trying with default parameters, and then with more
>> sensitive parameters if you suspect there could be more results than
>> initially found.  Of course, this experimentation really requires the
>> luxury of a big compute cluster!  
>>
>> Finally, there is another wiki page that may be helpful -- it was
>> written with cross-species alignments in mind, but it is a great
>> description of how someone was able to get a simplified version of our
>> pipeline running on his system for a small genome (150Mbase):
>>
>> http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto
>>
>> Hope that helps, and please send any further questions to
>> [email protected] -- we will reply more promptly next time!  
>>
>> Angie
>>
>>   
>>     
>>> Sorry for too many questions :) Thanks for any help
>>>
>>> -Xianjun
>>>
>>> ============ sample code from ~/kent/src/hg/makeDb/doc/hg18.txt
>>> =================
>>>
>>> #  BLASTZ SELF (DONE - 2006-01-17 - 2006-01-20 - Hiram)
>>>
>>>     ssh pk
>>>     mkdir /cluster/data/hg18/bed/blastzSelf.2006-01-17
>>>     cd /cluster/data/hg18/bed/blastzSelf.2006-01-17
>>>
>>>   # prepare the DEF file
>>>
>>>   cd /cluster/data/hg18/bed/blastzSelf.2006-01-17
>>>     time /cluster/bin/scripts/doBlastzChainNet.pl -verbose=2 \
>>>         -chainMinScore=10000 -chainLinearGap=medium -bigClusterHub=pk \
>>>         `pwd`/DEF > blastz.out 2>&1 &
>>>     #   real    640m37.637s
>>>
>>>     ssh kolossus
>>>     cd /cluster/data/hg18/bed/blastzSelf.2006-01-17
>>>     time HGDB_CONF=~/.hg.conf.read-only featureBits \
>>>         -noRandom -noHap hg18 chainSelfLink > fb.chainSelfLink 2>&1 &
>>>     #   real    21m52.697s
>>>     #   324067552 bases of 2858034764 (11.339%) in intersection
>>>
>>>
>>>     
>>>       
>
>   
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] self-chain for zebrafish

Reply via email to