Hello biojava again, After giving some thoughts about the possible ways to apply cloudization to modules in bio-java i have identified some possibilites: 1) The first one and the one i find most interesting can be to try to introduce the map-reduce framework to help to speed-up the pairwise alignment in the creation of the muliple sequence alignment. I see that biojava implements the CLUSTAL algorithm, and I have some experience with MSA programs, and it is known that the pairwise alignment it's the most demanding part of this algorithm when the number of sequences increases. This version of map-reduce all-to-all sequence alignment can also be used in the future if other progressive alignment algorithms are to be implemented (Maybe T-COFFE or others)
2)If the input files are big enough, it can be interesting to perform the parsing on this files while using a distributed infrastructure to speedup the process, in this case the map reduce framework would paralelize this process by splitting the input file in several chunks and making the parsing of the sequences that are in each chunk. 3)Another idea can be to try to have a hadoopify version of blast, in which the input file also can be splitted and then for each sequence in a chunk, the node would perform a local blast query. Since bio-java doesn't implement yet a blast version (Which i see is another GSoC project), this idea would require to make a wrapper to execute the ncbi blast program and then joining the results. Thanks for your feedback, which i'm hoping in order to submit my proposal Best regards! On Fri, Mar 30, 2012 at 6:35 PM, Arthur Oviedo <[email protected]>wrote: > Hello, > My name is Arthur, and i'm a master student at EPFL (École Polytechnique > Fédérale de Lausanne) in computer science. > I worked in different project that are somewhat related to BioJava and > cloud environment. > I have worked , while i was research assistant, (briefly) in a project > called UnaCloud ( > http://sistemas.uniandes.edu.co/~unacloud/dokuwiki/doku.php?id=recursos:documentacion) > which provides an opportunistic grid/cloud infrastructure for running > scientific experiments and we have used it to help bio-informaticians with > their different jobs like huge BLAST queryes, HMMER jobs, etc. > As part of my assistant work in the same university, I developed a cool > system called UnaCloud MSA which integrates some existing and mew developed > tools to analyze Multiple Sequence Alignments. It even uses the BioJava > library to perform some verification about the sequences. All of this is > also done employing the UnaCloud infrastructure. This work is still in > development and in preparation for publication. > http://unacloudmsa.uniandes.edu.co > Currently, i'm working on a class project on Hadoop (An implementation of > subset of the functionalities of a Database Manager System) using Hadoop > (Map-reduce) framework. > All of the mentioned projects have been implemented in Java, so i suppose > that i meet the java expertise requirement. > I would like to know more about this project and to know also the rough > dates where the Google Summer of Code would be held (To prepare my > schedule). > Thanks and best regards, > Arthur Oviedo > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
