On Thursday, 15 October 2015 at 21:16:18 UTC, Laeeth Isharc wrote:
On Wednesday, 14 October 2015 at 22:11:56 UTC, data pulverizer wrote:
On Tuesday, 13 October 2015 at 23:26:14 UTC, Laeeth Isharc wrote:
https://www.quora.com/Why-is-Python-so-popular-despite-being-so-slow
Andrei suggested posting more widely.

I am coming at D by way of R, C++, Python etc. so I speak as a statistician who is interested in data science applications.

Welcome...  Looks like we have similar interests.

That's good to know

To sit on the deployment side, D needs to grow it's big data/noSQL infrastructure for a start, then hook into a whole ecosystem of analytic tools in an easy and straightforward manner. This will take a lot of work!

Indeed. The dlangscience project managed by John Colvin is very interesting. It is not a pure stats project, but there will be many shared areas of need. He has some v interesting ideas, and being able to mix Python and D in a Jupyter notebook is rather nice (you can do this already).

Thanks for bringing my attention to this, this looks interesting.

Sounds interesting. Take a look at Colvin's dlang science draft white paper, and see what you would add. It's a chance to shape things whilst they are still fluid.

Good suggestion.

3. Solid interface to a big data database, that allows a D data table <-> database easily

Which ones do you have in mind for stats? The different choices seem to serve quite different needs. And when you say big data, how big do you typically mean ?

What I mean is to start by tapping into current big data technologies. HDFS and Cassandra have C APIs which we can wrap for D.

4. Functional programming: especially around data table and array structures. R's apply(), lapply(), tapply(), plyr and now data.table(,, by = list()) provides powerful tools for data manipulation.

Any thoughts on what the design should look like?

Yes, I think this is easy to implement but still important. The real devil is my point #1 the dynamic data table object.


To an extent there is a balance between wanting to explore data iteratively (when you don't know where you will end up), and wanting to build a robust process for production. I have been wondering myself about using LuaJIT to strap together D building blocks for the exploration (and calling it based on a custom console built around Adam Ruppe's terminal).

Sounds interesting

6. Nullable types makes talking about missing data more straightforward and gives you the opportunity to code them into a set value in your analysis. D is streaks ahead of Python here, but this is built into R at a basic level.

So matrices with nullable types within? Is nan enough for you ? If not then could be quite expensive if back end is C.

I am not suggesting that we pass nullable matrices to C algorithms, yes nan is how this is done in practice but you wouldn't have nans in your matrix at the point of modeling - they'll just propagate and trash your answer. Nullable types are useful in data acquisition and exploration - the more practical side of data handling. I was quite shocked to see them in D, when they are essentially absent from "high level" programming languages like Python. Real data is messy and having nullable types is useful in processing, storing and summarizing raw data. I put in as #6 because I think it is possible to do practical statistics working around them by using notional hacks. Nullables are something that C#, and R have and Python's pandas has struggled with. The great news is that they are available in D so we can use them.


If D can get points 1, 2, 3 many people would be all over D because it is a fantastic programming language and is wicked fast.
What do you like best about it ? And in your own domain, what have the biggest payoffs been in practice?

I am playing with D at the moment. To become useful to me the data table structure is a must. I previously said points 1, 2, and 3 would get data scientists sucked into D. But the data table structure is the seed. A dynamic structure like that in D would catalyze the rest. Everything else is either wrappers, routine and maybe a lot of work but straightforward to implement. The data table structure for me is the real enigma.

The way that R's data types are structured around SEXPs is the key to all of this. I am currently reading through R's internal documentation to get my head around this.

https://cran.r-project.org/doc/manuals/r-release/R-ints.html

Reply via email to