FYI....
NCI Contracts ISB, Google, SRA Team to Develop Infrastructure for Cancer
Genomics Cloud Pilots
By Uduak Grace Thomas <[email protected]>

*This is the first of three stories looking at the Cancer Genome Pilot
proposals selected by the NCI.*

NEWYORK (GenomeWeb) – Researchers at the Institute for Systems Biology and
their partners from Google and SRA International have received a roughly
$6.5 million contract from the National Cancer Institute to develop one of
three sets of infrastructure for the Cancer Genomics Cloud pilots, an NCI
initiative to build sustainable computing infrastructure for accessing and
analyzing genomic and related data from its funded research projects.

The ISB-led team, who are proposing a system based on Google's cloud, is
one of three groups that were selected to receive cost reimbursement
contracts to develop cost-effective, sustainable cloud-based compute and
storage systems that address the limitations of current infrastructure used
to manage and analyze data from large-scale NCI-funded projects.

In addition to the contract awarded to the ISB-led team, a proposal from
the Broad Institute partnering with the University of California, Santa
Cruz was accepted and awarded over $7 million to develop their planned
system. The third contract went to Seven Bridges Genomics, whose funding is
just shy of $5.9 million. These amounts include base costs and all options,
according to the grant announcement. Also, the final allocations were made
based on cost estimates provided by the respective applicants in their
proposals.

For their part, ISB is proposing a system built on the Google cloud
infrastructure that will offer both programmatic and web-based access to
the data, Ilya Shmulevich, an ISB professor and principal investigator on
the NCI contract, told *BioInform*. This project extends an existing
relationship between the ISB and Google that dates back to 2012 when Google
first rolled out its Google Compute engine. Google tapped the Shmulevich
group
<http://www.genomeweb.com/informatics/google-works-isb-evaluate-life-sciences-application-area-new-cloud-infrastructur>
to
evaluate the infrastructure's ability to handle life science computing
requirements and adapted software his lab had developed to analyze TCGA to
run on the newly minted system.

For this new endeavor, the partners plan a platform based on Google's cloud
that leverages Google Genomics' application programming interfac
<https://cloud.google.com/genomics/v1beta/reference/>e — which also
includes an implementation of theGenomics API
<http://www.genomeweb.com/informatics/ga4gh-data-working-group-plans-new-modules-recently-updated-api-continues-suppor>
developed
by the Global Alliance for Genomics and Health's data working group — for
storing, processing, querying, exploring, and sharing data. It will also
have a tractable web interface through which less informatics-savvy
researchers can interact with and explore the data, he said. Researchers
will also use scalable, reliable virtual machines and storage
infrastructure provided by Google as well as its familiar collaboration
resources and services including Google Docs and Google Hangout.

The third partner in this triad, SRA International, will contribute
security, testing, and documentation to the pilot, Shmulevich said. SRA's
expertise in these areas is gleaned at least in part from its years of
working on multiple federally funded projects including The Cancer Genome
Atlas. In an email to *BioInform* SRA's Senior Director of Bioinformatics,
John Greene, said that his firm "looks forward to using our deep knowledge
of the TCGA data acquired over the last seven years of running the Data
Coordinating Center for that project to help Ilya's strong team …
demonstrate the increasing value of doing large-scale biological data
computations on a public cloud platform." David Pot, SRA's director of
bioinformatics, has already begun "assembling our part of the team to deal
with security and testing," he added.

Furthermore, researchers will also be able to upload their own private
datasets and explore them in the context of the larger public information
that will be available in the cloud. ISB and its collaborators intend to
include not just the TCGA core data in their infrastructure but also all
the orthogonal data types including gene expression and clinical data as
well as data from the 1000 Genomes project and the GlaxoSmithKline
cancer-cell-line data set, Shmulevich said.

For the purpose of the pilots, participants are only required to show that
their systems can handle the TCGA's 2.5 petabytes of data plus one
orthogonal data type, but ISB and its partners intend to provide a much
richer and more comprehensive resource for the community to try out.
Google's cloud is certainly capable of handling that much data and more;
and ISB routinely processes large quantities of data of different kinds in
its capacity as one of the TCGA's Genome Data Analysis Centers, Shmulevich
noted, so they are well equipped to design a system that meets these
requirements and scale as needed.

The NCI's board of Scientific Advisors and the National Cancer Advisory
Board first approved the Cancer Genome cloud pilots in June 2013 following
a detailed presentation
<http://www.genomeweb.com/informatics/nci-board-approves-proposed-cancer-genomics-cloud-pilots>
that
explained the concept delivered by George Komatsoulis, who at the time was
NCI's chief information officer and interim director of its Center for
Biomedical Informatics and Information Technology. Anticipating petabytes
of data from the TCGA and similar projects and responding to data access
and use barriers such as limited local compute and protracted download
times, the agency set out to create a communal resource that addresses
these issues by providing co-located computational capacity and storage as
well as APIs that connect software, data, and compute resources.

In January this year, the NCI issued a broad agency announcement
<http://www.genomeweb.com/informatics/nci-begins-accepting-proposals-cancer-genomics-cloud-pilots>
(BAA)
that laid out in more detail information about the pilots' research and
technical objectives, architecture and eligibility requirements, and
proposal expectations including budgetary requirements. The institute also
hosted a conference call and webcast that allowed members of the academic
and commercial communities to give feedback on the document and ask
questions. The release of the BAA officially launched the six week
proposal-collection phase for the pilots, and at the time, the NCI
estimated that it would spend approximately $20 million on the three or
more contracts.

In response to the BAA, the NCI received "many strong proposals," Anthony
Kerlavage, branch chief of informatics programs at the NCI's Center for
Biomedical Informatics and Information Technology, told *BioInform* last
week, but the three that it ultimately selected were "superior to the
others in our technical evaluation and in our opinion of the ones that
would bring the best value to the NCI."

The authors of the winning proposals now have six months to complete their
initial designs and begin developing their platforms. After that, they'll
move into one of two nine-month option periods. During the first, they have
to complete and implement their systems. By the time the second one gets
underway, the systems should be fully operational and ready to be evaluated
by the NCI and the broader cancer research community. In total,
development, testing, and evaluation of the clouds should take about 24
months.

Because each proposal adopts a unique infrastructure development approach
and leverages different solutions, these pilots provide a golden
opportunity to assess multiple alternatives in tandem and to see which
single system or which combination of systems best democratizes access to
NCI's datasets and is sustainable in the long run, Kerlavage said. In
addition, there's an opportunity to start exploring mechanisms for making
related NCI informatics initiatives interoperable, for example, linking the
Cancer Genome clouds to the Genomics Data Commons (GDC), he noted — the
grant for developing the GDC was awarded to the University of Chicago in
May this year. Interoperable infrastructure is also a major part of the
NCI's informatics strategy so as part of this process, all four teams may
contribute to efforts to define common application programming interfaces
that link data and infrastructure across sites, he said.

Shmulevich expressed similar sentiments about working with the other
awardees on the interoperability issue and added that his team also intends
to engage the broader cancer research community in its efforts. To that
end, to encourage participation during the evaluation phase of the pilots,
the ISB team will give away free cloud credits — using a quota system — for
compute and storage to the community that they can use to put the system
through its paces. "That's the real test [of whether] this is going to be
useful infrastructure," he said.

Reply via email to