As a matter of fact it is nowhere near close to being data intensive, it does take gigabytes of input data to process, however it is mostly RAM and CPU intensive. Although post-processing of alignment files is exactly where hadoop excels. At least as far as I understand majority of time is spent on DP alignment whereas navigation in seed space and N*log(n) sort requires only a fraction of that time - that was my experience applying hadoop cluster to sequencing human genomes.
--- Dmitry Pushkarev +1-650-644-8988 -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Michael Schatz Sent: Wednesday, April 08, 2009 9:19 PM To: [email protected] Subject: CloudBurst: Hadoop for DNA Sequence Analysis Hadoop Users, I just wanted to announce my Hadoop application 'CloudBurst' is available open source at: http://cloudburst-bio.sourceforge.net In a nutshell, it is an application for mapping millions of short DNA sequences to a reference genome to, for example, map out differences in one individual's genome compared to the reference genome. As you might imagine, this is a very data intense problem, but Hadoop enables the application to scale up linearly to large clusters. A full description of the program is available in the journal Bioinformatics: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp236 I also wanted to take this opportunity to thank everyone on this mailing list. The discussions posted were essential for navigating the ins and outs of hadoop during the development of CloudBurst. Thanks everyone! Michael Schatz http://www.cbcb.umd.edu/~mschatz
