Hi Jennifer, many thanks for the detailed explanations, I think that I now understand how it works and where to be careful. We'll run some tests on the specific species and regions we're dealing with.
I have to admit though that I didn't understand your comment about the Conservation track: looking e.g. at Drosophila melanogster, the Conservation track (full view) contains the phastCons track and a graphical representation of the alignment blocks. In the Table Browser, the Conservation track contains the tables: phastCons, multiz15waySummary, multiz15wayFrames, and multiz15way. I'm not sure how I would use them for an automated genome-wide homology mapping between the species (with all the caveats regarding homology/orthology you mentioned). Or did you anyway recommend this only for a manual analyzes of a few regions? Many thanks and best wishes, Alex On Aug 21, 2009, at 8:08 PM, Jennifer Jackson wrote: > Hi Alex, > > Using liftOver for cross-species comparisions can be tricky. For > each pair, it may take a few passes to get all alignments, making > the parameters more permissive, and perhaps ranking the results > based on how strict the parameters were. > > For same species, our defaults are minmatch=0.9 and multiple=N > This means that there must be at least a 90% identity to a single > location in order for the mapping to work. > > For cross-species, our defaults are minmatch=0.1 and multiple=Y > This means that there must be at least a 10% identity and multiple > locations may be reported. > > The other paramaters that you mention are defined - there is not > much to add. Increasing the "chain" length for either query or > target will filter out shorter alignments, etc. > > The files that liftvOver are based on (chains) are not evaluated to > be confirmed syntentic reagions. They are only ranked by probability > (see the chain file documentation in the link from Kayla). And the > genes assciated with these regions are certainly not confirmed > orthologs. They could be paralogs or homologues or orthologs But as > you know, ortholog implies similiar function, which is a more > complicated question and is frequently not a 1-1 relationship nor > identifiable by sequence holomogy for more distant species. So, you > are correct to suspicious of the "more hits" you are getting. > > For a different view that incorporates more factors (evolutionay > distance, for example) use the Conservation track. It is better > suited for cross-species syntentic analysis. Then, if you find > sequences that you believe are orthologs, and they also have strong > syntenic evidence, that would give you perhaps more confidence in > that ortholog for evolutionarily close. It is not true all true > orthologs will be syntenic, especially as the evolutionary distance > increases. > > Sorry that there is no exact answer for your question. The liftOver > utility is just a tool. Experimenting with paramaters and then > evaluating the results from a scientific perpective is the > recommended anlaysis path. And each species pair - and even each > genomic location - may need special handling. LiftOver is explicity > not recommended for cross-species comparisions in the documentation, > but is it used anyway by many for a first look, especially for > species with close evolutionary distance. > > Good luck, > Jennifer Jackson > > ------------------------------------------------ > Jennifer Jackson > UCSC Genome Bioinformatics Group > > ----- "Alexander Stark" <[email protected]> wrote: > >> From: "Alexander Stark" <[email protected]> >> To: "Kayla Smith" <[email protected]> >> Cc: [email protected] >> Sent: Friday, August 21, 2009 7:22:12 AM GMT -08:00 US/Canada Pacific >> Subject: Re: [Genome] (advanced) liftOver questions >> >> Dear Kayla, >> >> thank you very much for your quick response, the explanations and the >> >> links. They were very helpful in understanding your pipeline and the >> >> concepts of chains and nets. >> >> However, I didn't find an explanation about the command-line >> parameters of liftOver and how they affect the results. I'd be >> extremely happy if someone could find some time to check my problems/ >> >> questions. >> >> Very many thanks in advance, >> >> Alex >> >> >> >> >> On Aug 21, 2009, at 1:12 AM, Kayla Smith wrote: >> >>> >>> Hello Alexander, >>> >>> Please see this previously answered mailing list question, "How does >> >>> LiftOver Work?": >>> >>> https://lists.soe.ucsc.edu/pipermail/genome/2008-March/015810.html >>> >>> Hopefully this should answer some of your questions. If you still >>> require assistance, please write back to us. >>> >>> Kayla Smith >>> UCSC Genome Bioinformatics Group >>> >>> ----- "Alexander Stark" <[email protected]> wrote: >>> >>>> Hi all, >>>> >>>> we're using liftOver quite extensively to translate coordinates >>>> between different species. In general, it seems to work quite well >> >>>> for >>>> >>>> us and the results typically make sense when we inspect them >>>> visually. >>>> >>>> However, sometimes we run into problems, especially for >> coordinate- >>>> conversions between more distantly related species. Unfortunately, >> we >>>> >>>> could not find a more detailed description of how liftOver works >>>> (apart from the short help it prints) and what the command line >>>> parameters do - we hope someone can help. >>>> >>>> It is our understanding that liftOver essentially uses the UCSC >>>> alignments (or the underlying data) for the conversions. This >> should >>>> >>>> mean that any input region can map to 0, 1, or several contiguous >>>> regions in the target genome, that the region length can change, >> and >>>> >>>> that only a certain fraction of the input nucleotides correspond >> to >>>> (i.e. map to) target nucleotides. >>>> >>>> We assume that the behavior of liftOver with respect to these can >> be >>>> >>>> controlled using the following parameters: >>>> >>>> -minMatch=0.N Minimum ratio of bases that must remap. Default 0.95 >>>> -minBlocks=0.N Minimum ratio of alignment blocks/exons that must >> map >>>> >>>> (default 1.00) >>>> -fudgeThick If thickStart/thickEnd is not mapped, use the >>>> closest >>>> mapped base. Recommended if using -minBlocks. >>>> -multiple Allow multiple output regions >>>> -minChainT, -minChainQ Minimum chain size in target/query, when >>>> mapping to multiple output regions (default 0, 0) >>>> >>>> Could you please give some details on what exactly the parameters >> do? >>>> >>>> This is very important for us to know in order to use the tool >>>> appropriately. For example: >>>> >>>> 1. What does "remap" mean for the minMatch parameter? >>>> Is it the fraction of input bases that have a target counterpart, >> >>>> i.e. >>>> >>>> that would appear aligned in a sequence alignment (or is it the >>>> fraction of target-bases that have an input counterpart)? >>>> >>>> When relaxing this parameter, we typically get more lifted >> regions. >>>> Are these however still orthologous/unique or will we run into a >>>> specificity problem? I understand that liftOver only uses a pre- >>>> computed alignment (or coordinate lookup-table) that - in principle >> - >>>> >>>> only contains alignments between orthologous regions. In other >> words, >>>> >>>> I do NOT expect liftOver to simply find more and more "matches" >> that >>>> >>>> make less and less sense as e.g. blast would do when lowering its >>>> specificity. >>>> >>>> 2. How does the minMatch parameter influence the growing & >> shrinking >>>> >>>> of region-length >>>> Does a more relaxed minMatch parameter allow for more variable >>>> region- >>>> >>>> length between input and target regions? In other words: if it >> only >>>> assesses the fraction of input nucleotides that have a >> counterpart, >>>> the region can grow freely but not shrink and vice versa. >>>> >>>> 3. Will we "loose" regions? >>>> When lowering minMatch, will regions that are uniquely mapped with >> a >>>> >>>> stringent minMatch parameter map to multiple regions/blocks and >> thus >>>> >>>> become unmapped? >>>> >>>> 4. Does "multiple" allow that an input region spans multiple >> output >>>> blocks or does it allow non-unique mapping (of the same region) >>>> >>>> 5. What does minChainT and minChainQ mean (i.e. what is a chain >> size, >>>> >>>> etc.)? >>>> >>>> 6. what does minBlocks do? does it apply to regions that span >>>> multiple >>>> >>>> alignment blocks and require that the same number of alignment >> blocks >>>> >>>> must be in the input and target region? >>>> >>>> Very many thanks for your help in advance and sorry for all the >>>> questions. >>>> >>>> Best, >>>> >>>> Alex >>>> >>>> >>>> >>>> >>>> >>>> ********** >>>> Alexander Stark, PhD >>>> Group Leader >>>> Institute of Molecular Pathology (IMP) >>>> Dr. Bohr-Gasse 7; 1030 Vienna >>>> Austria >>>> >>>> Tel. +43 (1) 79730-3380 >>>> [email protected] >>>> http://www.imp.ac.at/research/alexander-stark/ >>>> >>>> >>>> _______________________________________________ >>>> Genome maillist - [email protected] >>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome >> >> _______________________________________________ >> Genome maillist - [email protected] >> https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
