On 9/22/13 5:07 PM, Tom Johnson wrote:
What Analytic Methods are possible with truly Big Data
<http://www.linkedin.com/groups/What-analytic-methods-are-possible-35222.S.266561005?view=&srchtype=discussedNews&gid=35222&item=266561005&type=member&trk=eml-anet_dig-b_pd-ttl-cn&fromEmail=&ut=06_JU5ersJX5U1>"
To scale a database to petabyte scale storage and beyond, it needs to be
partitioned. Distributing data balances the load of the I/O across
different hardware. A given drive (in an array) can only read and write
so fast, and, worse, if the path to a drive is congested it doesn't
matter has fast it is. It seems to me the appeal of non-RDBMs
technologies, etc. is that they either 1) force a user to confront the
location of data or 2) propose some gross simplification like that
fields are not compared (column-oriented databases) -- an assumption
that may or may not be wise in the long term.
In contrast, traditional databases (DB2, Oracle, Postgres) have a
high-level and versatile means to query data (SQL), but don't
necessarily perform well if used in a naive way at scale. Postgres, for
example, does not give a simple unified view of N distributed
databases. It gives N databases. But if one develops a scheme to query
N databases and have appropriate logic to merge/filter the result, it
scales just fine. The tradeoff for the user is whether they prefer a
simple tools and a simple performance model, or whether they are
prepared to invest to make a more flexible tool, with a more challenging
performance model. The runtime cost of an SQL query can vary by many
orders of magnitude depending on how data is indexed, whether the cost
data is accurate (e.g. disk head seek time), and whether it is partitioned.
Lisp has been blamed for decades for being slow, "Lisp programmers know
the value of everything but the cost of nothing." A more nuanced
observation is that "Bad Lisp programmers write slow programs, whereas
bad C++ programs write no programs." Same idea applies to the RDBMs
vs. Hadoop style databases. The problem is not that one is slow, it's
that the user is either incapable or unwilling to get their head around
the performance model. Some people like crude tools because they lack
the patience, opportunity, or literacy to learn about them. If the
use of a tool is silly, it is hard for these people to recognize their
own ignorance -- it is easier to blame the tool.
Partitioning and analytics are related in ways that can be a pain. As a
simple example, consider what's needed to compute a median. If the data
is of moderate size (say tens of gigabytes), it can be pulled into
memory and sorted and then cut in half to find the median. If it is a
petabyte and on 1000 separate database servers, then there is no one
place where it can be sorted in place. Instead a merge sort is needed.
The point is that the algorithms chosen to realize various
statistical methods may have to change in order to scale at all. And
Fortran or C codes (that implement statistical packages like SAS, SPSSX,
or R), don't inherently know where memory is (because the programming
language does not explicitly represent that), so the compiler can't
recognize that data tables live in different places. So, even if the
algorithms didn't need to be restructured for `big data', the legacy
numerical codes don't necessarily lend themselves to re-use in
distributed memory systems.
Anyway, this just addresses the literal question you raised: I'd say
most analytic methods are suitable for Big Data, but the techniques and
technology have not become prevalent yet to make it so. It's another
area where there is good, honest development work to be done, and it
just needs to be.
Marcus
============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
to unsubscribe http://redfish.com/mailman/listinfo/friam_redfish.com