On Thursday, 15 October 2015 at 07:57:51 UTC, Russel Winder wrote:
On Thu, 2015-10-15 at 06:48 +0000, data pulverizer via Digitalmars-d- learn wrote: Just because D doesn't have this now doesn't mean it cannot. C doesn't have such capability but R and Python do even though R and CPython are just C codes.

I think the way R does this is that its dynamic runtime environment is used bind together native C basic type arrays. I wander if we could simulate dynamic behaviour by leveraging D's short compilation time to dynamically write/update data table source file(s) containing the structure of new/modified data tables?

Pandas data structures rely on the NumPy n-dimensional array implementation, it is not beyond the bounds of possibility that that data structure could be realized as a D module.

Julia's DArray object is an interested take on this: https://github.com/JuliaParallel/DistributedArrays.jl

I believe that parallelism on arrays and data tables are different challenges. Data tables are easier since we can parallelise by row, thus the preference of having row-based tuples.

The core issue is to have a seriously efficient n-dimensional array that is amenable to data parallelism and is extensible. As far as I am aware currently (I will investigate more) the NumPy array is a good native code array, but has some issues with data parallelism and Pandas has to do quite a lot of work to get the extensibility. I wonder how the R data.table works.

R's data table is not currently parallelised

I have this nagging feeling that like NumPy, data.table seems a lot better than it could be. From small experiments D is (and also Chapel is even more) hugely faster than Python/NumPy at things Python people think NumPy is brilliant for. Expectations of Python programmers are set by the scale of Python performance, so NumPy seems brilliant. Compared to the scale set by D and Chapel, NumPy is very disappointing. I bet the same is true of R (I have never really used R).

Thanks for notifying me about Chapel - something else interesting to investigate. When it comes to speed R is very strange. Basic math (e.g. *, +, /) operation on an R array can be fast but for-looping will kill speed by hundreds of times - most things are slow in R unless they are directly baked into its base operations. You can write code in C and C++ can call it very easily in R though using its Rcpp interface.


This is therefore an opportunity for D to step in. However it is a journey of a thousand miles to get something production worthy. Python/NumPy/Pandas have had a very large number of programmer hours expended on them.  Doing this poorly as a D modules is likely worse than not doing it at all.

I think D has a lot to offer the world of data science.

Reply via email to