Looks pretty good. One issue during the progressive alignment build up: 3D structure alignments can increase the reliability of the sequence alignments, particularly if the sequences are only distantly related. Having a way to incorporate the 3D structure info would be nice...
Andreas On Thu, Apr 8, 2010 at 9:26 AM, Gustavo Akio Tominaga Sacomoto <[email protected]> wrote: > Hi Andreas, > > On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic <[email protected]> wrote: >> Hi Gustavo, >> >> here my 0.02$: >> >> * For some of your steps there is already code available in BioJava. >> MIght be good to take a look at what is already there... (look at >> the alignment and phylo modules for dynamic programming and >> Neighbour-Joining) >> >> * What about risks? Where do you expect difficulties and how to work >> around them? >> >> * Step 4: Can you add more details? How do you plan to approach this? >> E.g. Clustalw has a number of rules implemented at this stage. Do you >> plan to support multiple rules as well and how to do this technically. >> Something nice would be the possibility to use structure alignments to >> guide the sequence alignments. (structure module) > > Based on it I rewrote the step 4 and add a "Main Risks" section. > > I pasted just the new version of step 4 and the new section at the end > of this e-mal. > > Thank you very much for your feedback. > > gustavo > > > > ------------------------------------------------------------------------------------------- > > ** 4. Implement the algorithm for progressive MSA and the MSA wrapper. > > A progressive MSA is a heuristic approach for the MSA problem, at > each step a pairwise alignment between two sequences, a sequence and > an alignment or between two alignments is done. So, the multiple > alignment is built incrementally, at each iteration more sequences are > aligned together. The guide tree gives an order for this incremental > alignment, in a bottom-up (in the tree) fashion sequences (or groups > of sequences) with greater similarity are aligned first. Therefore, in > order to have a more flexible and reusable code, the code design will > allow any binary tree of the sequences to be used as a guide tree, not > only the one built in the last step. This will allow a priori > phylogenetic or tertiary similarity (structural similarity) knowledge > be used to guide the multiple alignment order. > > This is certainly the most difficult part of the project, so to make > sure we are going to deliver a fully functional MSA algorithm, a safer > approach is going to be taken. In the first place, a a basic algorithm > described in [2] will be implemented. Once this get successfully done > and the code fully integrated to the Biojava code base, the features > described in [1] are going to be incrementally added (and tested) in > order to implement the full algorithm. This step is further divided in > substeps. > > *** 4.1 Implement a first simpler dynamic programming (DP) algorithm. > > This is the generalized pairwise alignment used in each iteration of > the progressive MSA. Gaps already presents in one of the alignments > (profiles) remain fixed, gap opening penalties remain unchanged, this > means that opening new gaps inside existent gaps will be fully > penalized. The code for this algorithm is similar to, the already > present in Biojava, code for regular pairwise alignment. > > *** 4.2 Implement the basic progressive MSA algorithm. > > In this substep is going to be implemented the incremental algorithm > to built the MSA, transversing a guide tree (parameter, could be the > one built in step 3 or any other one) in a bottom-up fashion and using > the algorithm from substep 4.1 at each iteration. > > *** 4.3 Implement the MSA wrapper. > > The MSA wrapper is going to be a method that wraps steps 2, 3 and > 4.2, giving a simple method (for the final user) to calculate the MSA. > Receiving as parameters the set of sequences to be aligned, the gap > opening penalty, gap extend penalty and residue matrix. Returning the > MSA for the sequence set. > At the end of this substep, we get a basic fully functional MSA > algorithm, using the progressive heuristic. > > *** 4.4 Implement gaps penalties rescaling and parameter default values. > > Gap penalties to open a new gap an extend a existing one (the affine > gap weight model) are user defined parameters. This substep will > define default values, based on the residue matrix, for this > parameters and implement global rescaling rules (based on sequences > sizes) for this parameters. > > *** 4.5 Enhance the DP algorithm to use different sequences weight. > > Based on the guide tree, for each sequence a different weight > (divergent sequences receive high values) is calculated and used in > the scoring scheme of the generalized DP algorithm. > > *** 4.6 Enhance the DP algorithm to use position based gap penalties. > > The DP algorithm from substep 4.1 uses globally defined gap opening > penalty. In this substep, the algorithm is going to be modified do use > position based penalty, this is simple, once is known an array of > opening penalties for each sequence position. This array is calculated > based on several hierarchical (only apply the first one that fits, if > any) rules, those are rescaling rules and the array is initialized > with the original gap opening penalty. > > Given the hierarchical nature of the rules, they can be implemented in > a incremental way, from the highest priority rule to the lowest, the > algorithm of each step being a refinement of the previous one. I am > omitting the detailed description of each rule. However, to verify if > a given rule apply to a given position, all that is necessary is to > check at most 16 adjacent positions and the same position in the other > already aligned sequences. > > At the end of each of the following steps we a have functional > algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete. > > **** 4.6.1 Lowered gap opening penalties at existing gaps. > **** 4.6.2 Increased gap opening penalties near existing gaps. > **** 4.6.3 Reduced gap opening penalties in hydrophilic stretches. > **** 4.6.4 Residue specific gap penalties. > > ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. > > EXTRA: Implement some benchmark technique to measure the final > alignment quality. > > Main Risks > ---------- > > The main risk to this project is the intrinsic complexity of the MSA > progressive algorithm. To deal with that we decided to break the > implementation in a large number of small and manageable steps, and > the steps are designed in a way that, at the end of each of them, we > will have a complete and testable new function (or a modification of > an existing one). Besides that, to be extra careful the project aims > to produce a simple full functional MSA algorithm as early as > possible, the estimated time is 8 weeks, this way we guarantee to > deliver at a simpler, but working and bug-free, version. > > > > >> Andreas >> >> >>> ------------------------------------------------------------- >>> >>> GSoC proposal >>> >>> Abstract >>> -------- >>> >>> This project aims to develop an all-Java implementation of a multiple >>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit, >>> using the progressive algorithm described in the CLUSTALW paper [1]. >>> >>> The Importance >>> -------------- >>> >>> Multiple sequence alignment is a frequently performed task in sequence >>> analysis with the goal to identify new members of protein families and >>> infer phylogenetic relationships between proteins and genes. At the >>> present there is no Java-only implementation for this algorithm. As >>> such the number of already existing and Java related BioInformatics >>> tools and web sites would benefit from this implementation and >>> sequence analysis could be more easily performed by the end-user. >>> >>> About Me >>> -------- >>> >>> I am a graduate student at University of São Paulo (Brazil), I got my >>> undergraduate degree from the same university with a major in Computer >>> Science and a minor in Biology. I have been involved with >>> Bioinformatics for 5 years, always with sequence analysis with >>> particular interest in the MSA problem. Also, in my undergraduate >>> final project I developed a lossless filter (pruning algorithm) for >>> the MSA problem, the work is published in [3] and there is an online >>> implementation of the algorithm in [4]. Finally, I have experience >>> with the C, C++, Java, Python and Ruby programming languages; Git and >>> SVN version control systems. >>> >>> Project Plan >>> ------------ >>> >>> The project is divided in four main steps, at the end of each step a >>> completely functional and bug-free new algorithm will be added to the >>> Biojava code base. It should be noticed that each step has a strong >>> dependence on the previous one, so before move to the next step a >>> careful testing will be done. >>> >>> The four steps are described below, estimated times for accomplishment >>> of each step are also given and in some steps extra enhancements are >>> described, they will be implemented if there is some time remaining >>> after all steps are completed. >>> >>> ** 1. Study the Biojava pairwise alignment code and update it to be >>> compliant with Biojava 3. >>> >>> The pairwise alignment will play an important role in the MSA >>> algorithm. This step is also important for me to get used to the >>> Biojava coding standards and get in touch with the Biojava dev >>> community. >>> >>> ETA: 2 weeks. >>> >>> ** 2. Implement the algorithm to build the distance matrix. >>> >>> This is done using the pairwise alignment for each pair of sequence >>> in the set to be aligned. >>> >>> ETA: 1 week. >>> >>> EXTRA: Enhance the basic algorithm to use parallel strategies, use >>> several threads to calculate the pairwise alignment for different >>> pairs in the sequence set. >>> >>> ** 3. Implement the algorithm to build the guide tree. >>> >>> The guide tree is based on the distance matrix built in the last >>> step, the tree construction strategy adopted will be the Neighbor >>> Joining Algorithm. >>> >>> ETA: 2 weeks. >>> >>> ** 4. Implement the algorithm for progressive MSA using the guide tree. >>> >>> This is certainly the most difficult part of the project, so to make >>> sure we are going to deliver a fully functional MSA algorithm, a safer >>> approach is going to be taken. In the first place, a dynamic >>> programming algorithm described in [2] will be implemented. Once this >>> get successfully done and the code fully integrated to the Biojava >>> code base, the features described in [1] are going to be incrementally >>> added (and tested) in order to implement the full dynamic programming >>> algorithm. >>> >>> ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. >>> >>> EXTRA: Implement some benchmark technique to measure the final >>> alignment quality. >>> >>> References >>> ---------- >>> >>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417 >>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435 >>> [3] http://www.almob.org/content/4/1/3 >>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu >>> >>> >>> >>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <[email protected]> wrote: >>>> Hi Gustavo, >>>> >>>> In principle I agree to all, see details below: >>>> >>>> >>>> I think my question wasn't very clear, my intention in this project is >>>>> >>>>> to follow the approach (with the tree steps) outlined in the project's >>>>> page. Using the classical progressive alignment heuristic: build the >>>>> distance matrix, build the guide tree and using this tree >>>>> progressively align more sequences together. >>>> >>>> yes >>>> >>>>> >>>>> What I propose for the third step is a first implementation using the >>>>> (more simple) dynamic programming described in the first CLUSTAL paper >>>>> (I thinks it's from 1988) and incrementally improving the algorithm to >>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this >>>>> more or less what you had in mind? >>>> >>>> yes, sounds good. >>>> >>>>> >>>>> About parallel strategies, I think a relative easy way we could use it >>>>> is in the distance matrix construction, we could have several threads >>>>> calculating the pairwise alignment for different pairs of sequence in >>>>> the set. >>>> >>>> Correct. Probably a first implementation would be for a single machine/ >>>> multi CPU. More advanced implementations could provide support e.g. for >>>> Map/Reduce, JPPF, or something like that... >>>> >>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW >>>>> paper doesn't give any way to measure the quality of the result, they >>>>> consider a good alignment the one that is hard to improve by eye (But >>>>> they claim that for sequences sufficient similar, no pair less than >>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW >>>>> paper and leave the quality measure to the user? How concerned should >>>>> I be with that in this project? >>>> >>>> Getting an overall core-algorithm that works should be priority. The >>>> benchmarking part is not mandatory, but something to keep in mind... I have >>>> plenty of material for that, once we get to that stage... >>>> >>>>> I will try send to this mailing list a proposal draft until tomorrow >>>>> to have some feedback from you. >>>> >>>> Excellent, looking forward to it. >>>> >>>> Andreas >>>> >>>> -- >>>> ----------------------------------------------------------------------- >>>> Dr. Andreas Prlic >>>> Senior Scientist, RCSB PDB Protein Data Bank >>>> University of California, San Diego >>>> (+1) 858.246.0526 >>>> ----------------------------------------------------------------------- >>>> >>> >> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
