Hello Alex, You are correct - the Conservation track is just an input sequence homology source based on genomic. I suggested it because it may provide some confidence if many of the included species (or those known to be closer evolutionarily) are conserved in target gene regions.
Chain and Nets are also an important information source. And linking back through the Most Conservered track to each native genome and locating genes in these regions might work for a certain number of cases. From what I remember, flies can be difficult because there is sometimes a significant amount of variation between two "close" species, but this is probably already a consideration you are aware of. This would lead to a large problem with using liftOver (many false negatives - the genomic sequence homology is just not strong enough). It would take some work to develop a pipeline to identify Ortholog candidates. And probably some manual review, depending on how error tolerant your experiment is. If you want definitive results, even very large data sets can be reviewed/curated by people fairly quickly if the comparison work/probability ranking is done up front through an analysis pipeline and low interest and/or genes with repetitive segments or variable elements are omitted (IG factors & zinc fingers come to mind - I am sure you can think of others) The whole genome between two species at one go would be tough. Perhaps doing analysis per functional group with mulit-species as input would make more sense. The TransMap track in human might give you some ideas (it uses chains and nets, but towards the same goal - ortholog). The Pol Sel track is also pretty interesting. The methods for each are in human, but perhaps the methods and associated papers would help you make some decisions for your own pipeline. Some ideas, hope some of it helps! Best wishes, Jennifer ------------------------------------------------ Jennifer Jackson UCSC Genome Bioinformatics Group ----- "Alexander Stark" <[email protected]> wrote: > From: "Alexander Stark" <[email protected]> > To: "Jennifer Jackson" <[email protected]> > Cc: [email protected], "Kayla Smith" <[email protected]> > Sent: Saturday, August 22, 2009 1:21:58 AM GMT -08:00 US/Canada Pacific > Subject: Re: [Genome] (advanced) liftOver questions > > Hi Jennifer, > > many thanks for the detailed explanations, I think that I now > understand how it works and where to be careful. We'll run some tests > > on the specific species and regions we're dealing with. > > I have to admit though that I didn't understand your comment about the > > Conservation track: looking e.g. at Drosophila melanogster, the > Conservation track (full view) contains the phastCons track and a > graphical representation of the alignment blocks. In the Table > Browser, the Conservation track contains the tables: phastCons, > multiz15waySummary, multiz15wayFrames, and multiz15way. I'm not sure > > how I would use them for an automated genome-wide homology mapping > between the species (with all the caveats regarding homology/orthology > > you mentioned). Or did you anyway recommend this only for a manual > analyzes of a few regions? > > Many thanks and best wishes, > > Alex > > > > > > > On Aug 21, 2009, at 8:08 PM, Jennifer Jackson wrote: > > > Hi Alex, > > > > Using liftOver for cross-species comparisions can be tricky. For > > each pair, it may take a few passes to get all alignments, making > > the parameters more permissive, and perhaps ranking the results > > based on how strict the parameters were. > > > > For same species, our defaults are minmatch=0.9 and multiple=N > > This means that there must be at least a 90% identity to a single > > location in order for the mapping to work. > > > > For cross-species, our defaults are minmatch=0.1 and multiple=Y > > This means that there must be at least a 10% identity and multiple > > > locations may be reported. > > > > The other paramaters that you mention are defined - there is not > > much to add. Increasing the "chain" length for either query or > > target will filter out shorter alignments, etc. > > > > The files that liftvOver are based on (chains) are not evaluated to > > > be confirmed syntentic reagions. They are only ranked by probability > > > (see the chain file documentation in the link from Kayla). And the > > > genes assciated with these regions are certainly not confirmed > > orthologs. They could be paralogs or homologues or orthologs But as > > > you know, ortholog implies similiar function, which is a more > > complicated question and is frequently not a 1-1 relationship nor > > identifiable by sequence holomogy for more distant species. So, you > > > are correct to suspicious of the "more hits" you are getting. > > > > For a different view that incorporates more factors (evolutionay > > distance, for example) use the Conservation track. It is better > > suited for cross-species syntentic analysis. Then, if you find > > sequences that you believe are orthologs, and they also have strong > > > syntenic evidence, that would give you perhaps more confidence in > > that ortholog for evolutionarily close. It is not true all true > > orthologs will be syntenic, especially as the evolutionary distance > > > increases. > > > > Sorry that there is no exact answer for your question. The liftOver > > > utility is just a tool. Experimenting with paramaters and then > > evaluating the results from a scientific perpective is the > > recommended anlaysis path. And each species pair - and even each > > genomic location - may need special handling. LiftOver is explicity > > > not recommended for cross-species comparisions in the documentation, > > > but is it used anyway by many for a first look, especially for > > species with close evolutionary distance. > > > > Good luck, > > Jennifer Jackson > > > > ------------------------------------------------ > > Jennifer Jackson > > UCSC Genome Bioinformatics Group > > > > ----- "Alexander Stark" <[email protected]> wrote: > > > >> From: "Alexander Stark" <[email protected]> > >> To: "Kayla Smith" <[email protected]> > >> Cc: [email protected] > >> Sent: Friday, August 21, 2009 7:22:12 AM GMT -08:00 US/Canada > Pacific > >> Subject: Re: [Genome] (advanced) liftOver questions > >> > >> Dear Kayla, > >> > >> thank you very much for your quick response, the explanations and > the > >> > >> links. They were very helpful in understanding your pipeline and > the > >> > >> concepts of chains and nets. > >> > >> However, I didn't find an explanation about the command-line > >> parameters of liftOver and how they affect the results. I'd be > >> extremely happy if someone could find some time to check my > problems/ > >> > >> questions. > >> > >> Very many thanks in advance, > >> > >> Alex > >> > >> > >> > >> > >> On Aug 21, 2009, at 1:12 AM, Kayla Smith wrote: > >> > >>> > >>> Hello Alexander, > >>> > >>> Please see this previously answered mailing list question, "How > does > >> > >>> LiftOver Work?": > >>> > >>> > https://lists.soe.ucsc.edu/pipermail/genome/2008-March/015810.html > >>> > >>> Hopefully this should answer some of your questions. If you still > >>> require assistance, please write back to us. > >>> > >>> Kayla Smith > >>> UCSC Genome Bioinformatics Group > >>> > >>> ----- "Alexander Stark" <[email protected]> wrote: > >>> > >>>> Hi all, > >>>> > >>>> we're using liftOver quite extensively to translate coordinates > >>>> between different species. In general, it seems to work quite > well > >> > >>>> for > >>>> > >>>> us and the results typically make sense when we inspect them > >>>> visually. > >>>> > >>>> However, sometimes we run into problems, especially for > >> coordinate- > >>>> conversions between more distantly related species. > Unfortunately, > >> we > >>>> > >>>> could not find a more detailed description of how liftOver works > >>>> (apart from the short help it prints) and what the command line > >>>> parameters do - we hope someone can help. > >>>> > >>>> It is our understanding that liftOver essentially uses the UCSC > >>>> alignments (or the underlying data) for the conversions. This > >> should > >>>> > >>>> mean that any input region can map to 0, 1, or several > contiguous > >>>> regions in the target genome, that the region length can change, > >> and > >>>> > >>>> that only a certain fraction of the input nucleotides correspond > >> to > >>>> (i.e. map to) target nucleotides. > >>>> > >>>> We assume that the behavior of liftOver with respect to these > can > >> be > >>>> > >>>> controlled using the following parameters: > >>>> > >>>> -minMatch=0.N Minimum ratio of bases that must remap. Default > 0.95 > >>>> -minBlocks=0.N Minimum ratio of alignment blocks/exons that must > >> map > >>>> > >>>> (default 1.00) > >>>> -fudgeThick If thickStart/thickEnd is not mapped, use the > closest > >>>> mapped base. Recommended if using -minBlocks. > >>>> -multiple Allow multiple output regions > >>>> -minChainT, -minChainQ Minimum chain size in target/query, when > >>>> mapping to multiple output regions (default 0, 0) > >>>> > >>>> Could you please give some details on what exactly the > parameters > >> do? > >>>> > >>>> This is very important for us to know in order to use the tool > >>>> appropriately. For example: > >>>> > >>>> 1. What does "remap" mean for the minMatch parameter? > >>>> Is it the fraction of input bases that have a target > counterpart, > >> > >>>> i.e. > >>>> > >>>> that would appear aligned in a sequence alignment (or is it the > >>>> fraction of target-bases that have an input counterpart)? > >>>> > >>>> When relaxing this parameter, we typically get more lifted > >> regions. > >>>> Are these however still orthologous/unique or will we run into a > >>>> specificity problem? I understand that liftOver only uses a pre- > >>>> computed alignment (or coordinate lookup-table) that - in > principle > >> - > >>>> > >>>> only contains alignments between orthologous regions. In other > >> words, > >>>> > >>>> I do NOT expect liftOver to simply find more and more "matches" > >> that > >>>> > >>>> make less and less sense as e.g. blast would do when lowering > its > >>>> specificity. > >>>> > >>>> 2. How does the minMatch parameter influence the growing & > >> shrinking > >>>> > >>>> of region-length > >>>> Does a more relaxed minMatch parameter allow for more variable > >>>> region- > >>>> > >>>> length between input and target regions? In other words: if it > >> only > >>>> assesses the fraction of input nucleotides that have a > >> counterpart, > >>>> the region can grow freely but not shrink and vice versa. > >>>> > >>>> 3. Will we "loose" regions? > >>>> When lowering minMatch, will regions that are uniquely mapped > with > >> a > >>>> > >>>> stringent minMatch parameter map to multiple regions/blocks and > >> thus > >>>> > >>>> become unmapped? > >>>> > >>>> 4. Does "multiple" allow that an input region spans multiple > >> output > >>>> blocks or does it allow non-unique mapping (of the same region) > >>>> > >>>> 5. What does minChainT and minChainQ mean (i.e. what is a chain > >> size, > >>>> > >>>> etc.)? > >>>> > >>>> 6. what does minBlocks do? does it apply to regions that span > >>>> multiple > >>>> > >>>> alignment blocks and require that the same number of alignment > >> blocks > >>>> > >>>> must be in the input and target region? > >>>> > >>>> Very many thanks for your help in advance and sorry for all the > >>>> questions. > >>>> > >>>> Best, > >>>> > >>>> Alex > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> ********** > >>>> Alexander Stark, PhD > >>>> Group Leader > >>>> Institute of Molecular Pathology (IMP) > >>>> Dr. Bohr-Gasse 7; 1030 Vienna > >>>> Austria > >>>> > >>>> Tel. +43 (1) 79730-3380 > >>>> [email protected] > >>>> http://www.imp.ac.at/research/alexander-stark/ > >>>> > >>>> > >>>> _______________________________________________ > >>>> Genome maillist - [email protected] > >>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome > >> > >> _______________________________________________ > >> Genome maillist - [email protected] > >> https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
