An aspect of big data that I think is a better description is not what the data 
is, or how large it is, or how it’s structured, but how you ask questions of it 
and the rationale for which it was gathered.

In the traditional scientific model, you formulate a hypothesis, you design an 
experiment to test that hypothesis.  Your experiment generates some data, which 
could be tiny or could be ridiculously huge.  But nevertheless, the data was 
gathered specifically to answer that question, and is probably not terribly 
useful for anything else.

Big data analysis by contrast does something different; you gather a large 
amount of data without any particular hypothesis in mind, or you pick some 
dataset that was gathered for some other purpose, and you simply look for ways 
the data can be clustered or organised, and then try to determine whether that 
tells you something interesting.

Using my own field as an example, genomics is definitely moving in that 
direction.  When I started looking for genetic variation associated with 
disease 20 years ago, sequencing was very expensive, so the typical hypothesis 
model was used; we have a bunch of candidate genes which are plausibly involved 
in cancer, diabetes, or whatever our condition of interest is.  We looked for 
variations specifically in those genes, and determined whether they associate 
with the condition.

Now, we use a much more “big data” approach.  We perform whole genome 
sequencing of thousands of individuals, without any hypothesis as to what might 
or might not be involved, and we let statistical analysis show us where the 
associations are.  What’s more, once a genome’s been sequenced for one project, 
it’s equally useful for any other association study that might be of interested 
(ethical and consent issues notwithstanding).

So perhaps whether your questioning of the data is hypothesis driven in the 
traditional sense is the criterion.

Tim

-- 
Dr Tim Cutts
Acting Head of Scientific Computing
Wellcome Trust Sanger Institute




-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to