Dear Michael

You rise a very good question. The number of microarray data is ever
increasing and dealing with 10,000 .cel files is quite challenging.

R and Bioconductor are great for developing and testing novel
algorithms, however, personally, I do not think that R will ever
be able to deal with massive amounts of data. 10,000 .cel files using
the newest GeneChips are equivalent to more than 200 Gigabyte of data,
so we are eventually talking about data in the Terabyte range.

Maybe, it is time to look how scientists used to handle large data
deal with this problem, such as the high energy physicists. Having
done this, I have decided to start to write my own expression analyisis
program which is no longer based on R but on C++ using a framework,
called ROOT, which is currently under development at CERN to deal
with Petabytes (!!) of data, see:
http://www.ci.tuwien.ac.at/Conferences/DSC-2003/Proceedings/Stratowa.pdf

Sorrowly, it takes me longer than expected to develop this software,
but you are looking ahead two or three years anyhow :-)

If microarray data would be stored in the way described, i.e. in the
same way as high energy physics data, this would already be a step
in the right direction.

However, this is only my personal opinion. In our company I still
use mainly R to analyse our microraay data.

Best regards
Christian Stratowa
Vienna     Austria


Michael Benjamin wrote:
Hi--

While I agree that we cannot agree on the ideal algorithms, we should be
taking practical steps to implement microarrays in the clinic.  I think
we can all agree that our algorithms have some degree of efficacy over
and above conventional diagnostic techniques.  If patients are dying
from lack of diagnostic accuracy, I think we have to work hard to use
this technology to help them, if we can.  I think we can, even now.

What if I offer, in my clinic, a service for cancer patients to compare
their affy data to an existing set of data, to predict their prognosis
or response to chemotherapy?  I think people will line up out the door
for such a service.  Knowing what we as a group of array analyzers know,
wouldn't we all want this kind of service available if we or a loved one
got cancer?

Can our programs deal with 1,000 .cel files? 10,000 files?

I think our programs are pretty good, but what we need is DATA.  We must
be careful what we wish for--we might get it!  So how do we measure
whether analyzing 10,000 .cel files with library(affy) is feasible?  I'm
assuming that advanced hardware would be required for such a task.  What
are the critical components of such a platform?  How much money would a
feasible system for array analysis cost?

I was just looking ahead two or three years--where is all this genomic
array research headed? I guess I'm concerned about scalability.


Is anyone really working on implementing affy on a cluster/Beowulf?
That sounds like a real challenge.

Regards,
Michael Benjamin, MD
-----Original Message-----
From: Liaw, Andy [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 9:47 PM
To: 'Michael Benjamin'
Subject: RE: [BioC] R performance questions


Another point about benchmarking:  As has been discussed on R-help
before,
benchmarks can be misleading, as the one you mentioned.  It measures
linear
algebra tasks, etc., but that typically account for very small portion
of
"average" tasks.  Doug Bates also pointed out that the eigen() example
used
in that benchmark is computing mostly meaningless results.

In our experience, learning to use R more efficiently gives us the most
mileage, but large and fast hardware wouldn't hurt...

Cheers,
Andy


-----Original Message-----
From: Michael Benjamin [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 7:32 PM
To: 'Liaw, Andy'
Subject: RE: [BioC] R performance questions



Thanks. Mike

-----Original Message-----
From: Liaw, Andy [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 8:17 AM
To: 'Michael Benjamin'
Subject: RE: [BioC] R performance questions


Hi Michael,

Just one comment about SVM. If you use the svm() function in the e1071
package to train linear SVM, it will be rather slow. That's a known
limitation of libsvm, of which the svm() function uses. If you are
willing
to go outside of R, the "bsvm" package by C.J. Lin (same person who
wrote
libsvm) will train linear svm in much more efficient manner.


HTH,
Andy


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael Benjamin
Sent: Tuesday, December 02, 2003 10:30 PM
To: [EMAIL PROTECTED]
Subject: [BioC] R performance questions



Hi, all--


I wanted to start a thread on R speed/benchmarking. There

is a nice R


benchmarking overview at

http://www.sciviews.org/other/benchmark.htm,


along with a

free script so you can see how your machine stacks up.


Looks like R is substantially faster than S-plus.

My problem is this: with 512Mb and an overclocked AMD

Athlon XP 1800+,


running at 588 SPEC-FP 2000, it still takes FOREVER to analyze multiple
.cel files using affy (expresso). Running svm takes a mighty long time
with more than 500 genes, 150 samples.


Questions:
1) Would adding RAM or processing speed improve performance

the most?


2) Is it possible to run R on a cluster without rewriting my high-level
code? In other words,
3) What are we going to do when we start collecting

terabytes of array


data to analyze? There will come a "breaking point" at

which desktop


systems can't perform these analyses fast enough for large quantities of
data. What then?


Michael Benjamin, MD
Winship Cancer Institute
Emory University,
Atlanta, GA

_______________________________________________
Bioconductor mailing list
[EMAIL PROTECTED]
https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor






______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help



______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Reply via email to