Dear Sir, I am very interested in contributing to this project.
I am looking for a good problem,more on the research side. I can also help in coding (I also work as a software engineer-j2ee/eclipse/jboss/tomcat .. Anything that I could work on... Regards, Jitesh Dundas On 4/8/10, Andreas Dräger <[email protected]> wrote: > Hi all, > > This e-mail is just for your information about somebody new, who'd like > to contribute to our project. > > Cheers > Andreas > > > Subject: > Re: Fwd: Proposing a project on "Biojava alignment lead" > From: > Andreas Dräger <[email protected]> > Date: > Wed, 07 Apr 2010 09:27:13 +0200 > To: > Cai Shaojiang <[email protected]> > > Hi Cai Shaojiang, > > Thank you for you e-mail! I don't know what happened to the e-mail list. > Sometimes it takes a while due to the spam filters, I guess. > > > I am a PhD student from National University of Singapore. My major > research area is local alignment algorithms and data structures for SNP > identification. And I have used Java and Eclipse for years for software > development. I am very interested in your GSoC programme. I find that > there is a module called "biojava-alignment lead" whose mentor is you. I > want to propose a new project on this module. I have several questions > about this module. > > Yes, that's me. So great to get your support. > > > 1. It seems that pairwise alignment is to find similarity between two > short sequences. Existing pairwise alignment is based on dynamic > programming, is it Smith-Waterman algorithm? > > So, currently, BioJava contains three different alignment approaches. > There are two deterministic algorithms, i.e., Smith-Waterman for local > alignment and Needleman-Wunsch for global alignment. Third, there is the > possibility to apply Hidden Markov Models for alignment. An example of > the latter approach should be in the cookbook. > > > 2. What is the exact task of "refactoring of underlying data structures"? > > Yes, this is something, I did last week already but it could still be > improved. The problem was that the alignment algorithms actually > produced a kind of string that looks similar to the output of BLAST. > This string contained the score, the computation time, the length of the > alignment etc. The problem was that people wanted to perform > higher-level computation on the score value or evaluate some other > information. Now, the alignment will produce a data structure that > contains all the information and can, in addition to that, also produce > such a BLAST-like output. There is, however, still the following > problem: The data structure requires both sequences in the pair-wise > alignment to have an identical length. In case of local alignment this > is especially stupid (actually), because gaps are inserted to fill the > sequences. And then the data structure tries to keep the old sequence > coordinates, leading to the effect that the numbers "query start", > "query end", "subject start", and "subject end" are required to shift > the sequences against each other when displaying the output. So, you > cannot easily print the sequences below of each other, you first have to > shift them. Please check out the latest version of this package via > anonymeous svn and have a look ;-) > > > 3. My existing research area is aiming to deal with aligning short > read (10s~100s bp) against extremely long sequences (e.g., human > genome). Af far as I know, there is not existing such alignment tools > implemented in Java. Would you consider this direction? > > See, this would be very nice to include. But this requires that we no > longer fill the short sequence with many, many gap symbols (just a waist > of memory), but improve the data structure. There is already an > UnequalLenghtAlignment (just a data structure, no algorithm) and I think > we could use this as a starting point. Then your algorithm should only > produce such a data structure and this would be fine. > > > 4. It seems that the existing tools is just lacking of some > refactoring and representation interfaces. Any more underlying tasks? > > Hm. Yes: With the release of BioJava 3 data structures have changed > again. So maybe there's also some adaptation to the new structure required. > > > I am keeping an eye on GSoC from last month, but sorry to find out > that I sent the initial email to the mailing list before I subscribe it... > > Ok. Sounds good. Thanks for your interest. So I suggest: Download the > latest trunk, have a look, play around and if you can improve something > we'll put it into the trunk and write your name into the authors' tag. > > Cheers > Andreas > > -- > Dipl.-Bioinform. Andreas Dräger > Eberhard Karls University Tübingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 Tübingen > Germany > > Phone: +49-7071-29-70436 > Fax: +49-7071-29-5091 > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
