You rise a very good question. The number of microarray data is ever increasing and dealing with 10,000 .cel files is quite challenging.
R and Bioconductor are great for developing and testing novel algorithms, however, personally, I do not think that R will ever be able to deal with massive amounts of data. 10,000 .cel files using the newest GeneChips are equivalent to more than 200 Gigabyte of data, so we are eventually talking about data in the Terabyte range.
Maybe, it is time to look how scientists used to handle large data deal with this problem, such as the high energy physicists. Having done this, I have decided to start to write my own expression analyisis program which is no longer based on R but on C++ using a framework, called ROOT, which is currently under development at CERN to deal with Petabytes (!!) of data, see: http://www.ci.tuwien.ac.at/Conferences/DSC-2003/Proceedings/Stratowa.pdf
Sorrowly, it takes me longer than expected to develop this software, but you are looking ahead two or three years anyhow :-)
If microarray data would be stored in the way described, i.e. in the same way as high energy physics data, this would already be a step in the right direction.
However, this is only my personal opinion. In our company I still use mainly R to analyse our microraay data.
Best regards Christian Stratowa Vienna Austria
Michael Benjamin wrote:
Hi--
While I agree that we cannot agree on the ideal algorithms, we should be taking practical steps to implement microarrays in the clinic. I think we can all agree that our algorithms have some degree of efficacy over and above conventional diagnostic techniques. If patients are dying from lack of diagnostic accuracy, I think we have to work hard to use this technology to help them, if we can. I think we can, even now.
What if I offer, in my clinic, a service for cancer patients to compare their affy data to an existing set of data, to predict their prognosis or response to chemotherapy? I think people will line up out the door for such a service. Knowing what we as a group of array analyzers know, wouldn't we all want this kind of service available if we or a loved one got cancer?
Can our programs deal with 1,000 .cel files? 10,000 files?
I think our programs are pretty good, but what we need is DATA. We must be careful what we wish for--we might get it! So how do we measure whether analyzing 10,000 .cel files with library(affy) is feasible? I'm assuming that advanced hardware would be required for such a task. What are the critical components of such a platform? How much money would a feasible system for array analysis cost?
I was just looking ahead two or three years--where is all this genomic
array research headed? I guess I'm concerned about scalability.
Is anyone really working on implementing affy on a cluster/Beowulf? That sounds like a real challenge.
Regards,
Michael Benjamin, MD
-----Original Message-----
From: Liaw, Andy [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 9:47 PM
To: 'Michael Benjamin'
Subject: RE: [BioC] R performance questions
Another point about benchmarking: As has been discussed on R-help before, benchmarks can be misleading, as the one you mentioned. It measures linear algebra tasks, etc., but that typically account for very small portion of "average" tasks. Doug Bates also pointed out that the eigen() example used in that benchmark is computing mostly meaningless results.
In our experience, learning to use R more efficiently gives us the most mileage, but large and fast hardware wouldn't hurt...
Cheers, Andy
-----Original Message-----
From: Michael Benjamin [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 7:32 PM
To: 'Liaw, Andy'
Subject: RE: [BioC] R performance questions
Thanks. Mike
-----Original Message-----
From: Liaw, Andy [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 8:17 AM
To: 'Michael Benjamin'
Subject: RE: [BioC] R performance questions
Hi Michael,
Just one comment about SVM. If you use the svm() function in the e1071
package to train linear SVM, it will be rather slow. That's a known
limitation of libsvm, of which the svm() function uses. If you are
willing
to go outside of R, the "bsvm" package by C.J. Lin (same person who
wrote
libsvm) will train linear svm in much more efficient manner.
HTH, Andy
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael Benjamin
Sent: Tuesday, December 02, 2003 10:30 PM
To: [EMAIL PROTECTED]
Subject: [BioC] R performance questions
Hi, all--
I wanted to start a thread on R speed/benchmarking. There
is a nice R
benchmarking overview at
http://www.sciviews.org/other/benchmark.htm,
along with a
free script so you can see how your machine stacks up.
Looks like R is substantially faster than S-plus.
My problem is this: with 512Mb and an overclocked AMD
Athlon XP 1800+,
running at 588 SPEC-FP 2000, it still takes FOREVER to analyze multiple
.cel files using affy (expresso). Running svm takes a mighty long time
with more than 500 genes, 150 samples.
Questions:
1) Would adding RAM or processing speed improve performance
the most?
2) Is it possible to run R on a cluster without rewriting my high-level
code? In other words,
3) What are we going to do when we start collecting
terabytes of array
data to analyze? There will come a "breaking point" at
which desktop
systems can't perform these analyses fast enough for large quantities of
data. What then?
Michael Benjamin, MD Winship Cancer Institute Emory University, Atlanta, GA
_______________________________________________ Bioconductor mailing list [EMAIL PROTECTED] https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
