Data Frames in D - let's not wait for linear algebra; useful today in finance and Internet of Things

Laeeth Isharc via Digitalmars-d-learn Fri, 26 Dec 2014 22:26:50 -0800

On Friday, 26 December 2014 at 21:31:00 UTC, aldanor wrote:

On Wednesday, 25 September 2013 at 03:41:36 UTC, Jay Norwoodwrote:
I've been playing with the python pandas app enablesinteractive manipulation of tables of data in their dataframestructure, which they say is similar to the structures used inR.
It appears pandas has laid claim to being a faster version ofR, but is doing so basically limited to what they can exploitfrom moving operations back and forth from underlying cythoncode.
Has anyone written an example app in D that manipulatesdataframe type structures?
Pandas has numpy as "backend" which does a lot of heavylifting, so first things first -- imo D needs a fast andflexible blas/lapack-compatible multi-dimensional rectangulararray library that could later serve as backend for pandas-likelibraries.

I don't believe I agree that we need a perfect multi-dimensionalrectangular array library to serve as a backend before thinkingand doing much on data frames (although it will certainly be veryuseful when ready).

First, it seems we do have matrices, even if lacking in completefunctionality for linear algebra, and the like. There is achicken and egg aspect in the development of tools - it is rarelythe case that one kind of tool necessarily totally precedesanother, and often complementarities and dynamic effects betweendifferent stages. If one waits till one has everything one needsdone for one, one won't get much done.

Secondly, much of the kind of thing Pandas is useful for is notexactly rocket science from a quantitative perspective, but it'sjust the kinds of thing that is very useful if you are thinkingabout working with data sets of a decent size.The concepts seemto me to fit very well with std.algorithm and std.range, and canbe thought of as just as way to bring out the power of the toolswe alreaady have when working with data in the world as it is.See here for an example of just how simple. Remember Excelpivottables?


http://pandas.pydata.org/pandas-docs/stable/groupby.html

Thirdly, one of the reasons Pandas is popular is because it iswritten in C/Cython and very fast. It's significantly fasterthan Julia. One might hit roadblocks down the line when it comesto the Global Interpreter Lock and difficulty of processinglarger sets quickly in Python, but at least this stage is fastand easy. So people do care about speed, but they also careabout the frictions being taken away, so that they can spendtheir energies on addressing the problem at hand. Ie a dataframewill be helpful, in my view.

Processing of log data is a growing domain - partly frominternet, but also from the internet of things. See below forone company using D to process logs:


http://venturebeat.com/2014/11/12/adroll-hits-gigantic-130-terabytes-of-ad-data-processed-daily-says-size-matters/
http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

A poster on this forum is already using D as a library to callfrom R (from Reddit), which brings home the point that it isn'tnecessary for D to be able to do every part of the process for itto be able to take over some of the heavy work.


"[–]bachmeier 6 points 1 month ago

I call D shared libraries from R. I've put together a librarythat offers similar functionality to Rcpp. I've got apresentation showing its use on Linux. Both the presentation andlibrary code should be made available within the next couple ofdays.

My library makes available the R API and anything in Gretl. Youcan allocate and manipulate R objects in D, add R assertstatements in your D code, and so on. What I'm working on now iscalling into GSL for optimization.

These are all mature libraries - my code is just an interface.It's generally easy to call any C library from D, and modernFortran, which provides C interoperability, is not too muchharder.

See here, for just one use case in the internet of things. Theydon't use D, but maybe they should have. And it shows an examplewhere perhaps at least log processing could easily be handled bywhat we have with a few small additional data structures - evenif people use outside libraries for the machine learning part.


http://www.forbes.com/sites/danwoods/2014/11/04/how-splunk-caught-wall-streets-eye-by-taming-the-messy-world-of-iot-data/3/

"By using Splunk software, Hrebek said that his division’s leaderproduct is able to offer customers a real-time view of operationson a train and to use machine learning to suggest optimalstrategies for driving trains along various routes. Just shavinga small percentage off of fuel costs can mean huge savings for arailroad.


Why Doesn’t BI Work for the IoT?

In both of the use cases just mentioned, for years, existingbusiness intelligence technology had been applied to the problemof making sense of the data with little success.

The problem is not that that it is impossible to use traditionalETL technology and an RDBMS or, more commonly, spreadsheets toget something working so that some of the data becomes useful. Itis just that the effort involved is great and the technicaleffort involved in maintaining such systems is massive. Hrebekcompared using spreadsheets for IoT data to living in the ninthcircle of hell in Dante’s Inferno, because the process is sotedious and error prone.

Machine data is different from flat files that are the paradigmfor BI technology, which works in rows and columns. Also, machinedata can be naturally organized into a time series, but this isnot the default way that a spreadsheet or an RDBMS works.


Why Does Splunk Work for the IoT?

IoT data essentially looks exactly the same as the machine datafrom servers in a data center that Splunk Enterprise wasinitially created to handle. The software allows you to:


    Automatically parse fields
    Identify several different types of records as a related group
    Organize and store records by timestamp
    Create dashboards and analytics that are updated in real time

With each successive release, Splunk is making the process ofparsing machine data as automatic and machine assisted aspossible. Its software handles variations of IoT data by allowinga simple mapping of a field into a standard name. For example,the GPS coordinates of a train car might be recorded in six orseven different ways in various forms of machine data, but can beunified via Splunk Enterprise. Splunk software allows thesemappings to be implemented and maintained with a minimum ofeffort.

The bottom line is that there is no way to avoid theimperfections that naturally occur in the real world. We arealways going to have lots of trees and to have to deal with themboth as individuals and as a forest, in a normalized aggregateform. The reason Splunk is making such inroads in IoTapplications is that it can handle both the trees and the forestand turn the information from the real world into a clear view ofwhat is happening that allows useful models of reality to becreated. If you are building an IOT application, you must find away to handle the messy nature of the real world."

Many more similar oppties for D here:https://www.google.de/search?q=internet+of+things+massive+log+processing+growth&btnG=Search&oe=utf-8&gws_rd=cr



Laeeth.

Data Frames in D - let's not wait for linear algebra; useful today in finance and Internet of Things

Reply via email to