Hi all,

This e-mail is just for your information about somebody new, who'd like to contribute to our project.

Cheers
Andreas

--
Dipl.-Bioinform. Andreas Dräger
Eberhard Karls University Tübingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 Tübingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091
--- Begin Message ---
Hi Cai Shaojiang,

Thank you for you e-mail! I don't know what happened to the e-mail list. Sometimes it takes a while due to the spam filters, I guess.

I am a PhD student from National University of Singapore. My major research area is local alignment algorithms and data structures for SNP identification. And I have used Java and Eclipse for years for software development. I am very interested in your GSoC programme. I find that there is a module called "biojava-alignment lead" whose mentor is you. I want to propose a new project on this module. I have several questions about this module.

Yes, that's me. So great to get your support.

1. It seems that pairwise alignment is to find similarity between two short sequences. Existing pairwise alignment is based on dynamic programming, is it Smith-Waterman algorithm?

So, currently, BioJava contains three different alignment approaches. There are two deterministic algorithms, i.e., Smith-Waterman for local alignment and Needleman-Wunsch for global alignment. Third, there is the possibility to apply Hidden Markov Models for alignment. An example of the latter approach should be in the cookbook.

2. What is the exact task of "refactoring of underlying data structures"?

Yes, this is something, I did last week already but it could still be improved. The problem was that the alignment algorithms actually produced a kind of string that looks similar to the output of BLAST. This string contained the score, the computation time, the length of the alignment etc. The problem was that people wanted to perform higher-level computation on the score value or evaluate some other information. Now, the alignment will produce a data structure that contains all the information and can, in addition to that, also produce such a BLAST-like output. There is, however, still the following problem: The data structure requires both sequences in the pair-wise alignment to have an identical length. In case of local alignment this is especially stupid (actually), because gaps are inserted to fill the sequences. And then the data structure tries to keep the old sequence coordinates, leading to the effect that the numbers "query start", "query end", "subject start", and "subject end" are required to shift the sequences against each other when displaying the output. So, you cannot easily print the sequences below of each other, you first have to shift them. Please check out the latest version of this package via anonymeous svn and have a look ;-)

3. My existing research area is aiming to deal with aligning short read (10s~100s bp) against extremely long sequences (e.g., human genome). Af far as I know, there is not existing such alignment tools implemented in Java. Would you consider this direction?

See, this would be very nice to include. But this requires that we no longer fill the short sequence with many, many gap symbols (just a waist of memory), but improve the data structure. There is already an UnequalLenghtAlignment (just a data structure, no algorithm) and I think we could use this as a starting point. Then your algorithm should only produce such a data structure and this would be fine.

4. It seems that the existing tools is just lacking of some refactoring and representation interfaces. Any more underlying tasks?

Hm. Yes: With the release of BioJava 3 data structures have changed again. So maybe there's also some adaptation to the new structure required.

I am keeping an eye on GSoC from last month, but sorry to find out that I sent the initial email to the mailing list before I subscribe it...

Ok. Sounds good. Thanks for your interest. So I suggest: Download the latest trunk, have a look, play around and if you can improve something we'll put it into the trunk and write your name into the authors' tag.

Cheers
Andreas

--
Dipl.-Bioinform. Andreas Dräger
Eberhard Karls University Tübingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 Tübingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091


--- End Message ---
_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Reply via email to