Re: [FRIAM] Analytic methods for Big Data

Marcus G. Daniels Sun, 22 Sep 2013 17:46:09 -0700

On 9/22/13 5:07 PM, Tom Johnson wrote:

What Analytic Methods are possible with truly Big Data<http://www.linkedin.com/groups/What-analytic-methods-are-possible-35222.S.266561005?view=&srchtype=discussedNews&gid=35222&item=266561005&type=member&trk=eml-anet_dig-b_pd-ttl-cn&fromEmail=&ut=06_JU5ersJX5U1>"

To scale a database to petabyte scale storage and beyond, it needs to bepartitioned. Distributing data balances the load of the I/O acrossdifferent hardware. A given drive (in an array) can only read and writeso fast, and, worse, if the path to a drive is congested it doesn'tmatter has fast it is. It seems to me the appeal of non-RDBMstechnologies, etc. is that they either 1) force a user to confront thelocation of data or 2) propose some gross simplification like thatfields are not compared (column-oriented databases) -- an assumptionthat may or may not be wise in the long term.

In contrast, traditional databases (DB2, Oracle, Postgres) have ahigh-level and versatile means to query data (SQL), but don'tnecessarily perform well if used in a naive way at scale. Postgres, forexample, does not give a simple unified view of N distributeddatabases. It gives N databases. But if one develops a scheme to queryN databases and have appropriate logic to merge/filter the result, itscales just fine. The tradeoff for the user is whether they prefer asimple tools and a simple performance model, or whether they areprepared to invest to make a more flexible tool, with a more challengingperformance model. The runtime cost of an SQL query can vary by manyorders of magnitude depending on how data is indexed, whether the costdata is accurate (e.g. disk head seek time), and whether it is partitioned.

Lisp has been blamed for decades for being slow, "Lisp programmers knowthe value of everything but the cost of nothing." A more nuancedobservation is that "Bad Lisp programmers write slow programs, whereasbad C++ programs write no programs." Same idea applies to the RDBMsvs. Hadoop style databases. The problem is not that one is slow, it'sthat the user is either incapable or unwilling to get their head aroundthe performance model. Some people like crude tools because they lackthe patience, opportunity, or literacy to learn about them. If theuse of a tool is silly, it is hard for these people to recognize theirown ignorance -- it is easier to blame the tool.

Partitioning and analytics are related in ways that can be a pain. As asimple example, consider what's needed to compute a median. If the datais of moderate size (say tens of gigabytes), it can be pulled intomemory and sorted and then cut in half to find the median. If it is apetabyte and on 1000 separate database servers, then there is no oneplace where it can be sorted in place. Instead a merge sort is needed.The point is that the algorithms chosen to realize variousstatistical methods may have to change in order to scale at all. AndFortran or C codes (that implement statistical packages like SAS, SPSSX,or R), don't inherently know where memory is (because the programminglanguage does not explicitly represent that), so the compiler can'trecognize that data tables live in different places. So, even if thealgorithms didn't need to be restructured for `big data', the legacynumerical codes don't necessarily lend themselves to re-use indistributed memory systems.

Anyway, this just addresses the literal question you raised: I'd saymost analytic methods are suitable for Big Data, but the techniques andtechnology have not become prevalent yet to make it so. It's anotherarea where there is good, honest development work to be done, and itjust needs to be.


Marcus

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
to unsubscribe http://redfish.com/mailman/listinfo/friam_redfish.com

Re: [FRIAM] Analytic methods for Big Data

Reply via email to