Re: D is for Data Science

Dmitry Olshansky via Digitalmars-d-announce Mon, 24 Nov 2014 14:33:07 -0800

25-Nov-2014 00:34, weaselcat пишет:

On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote:

Just browsing reddit and found this article posted about D.
Written by Andrew Pascoe of AdRoll.


From the article:
"The D programming language has quickly become our language of choice
on the Data Science team for any task that requires efficiency, and is
now the keystone language for our critical infrastructure. Why?
Because D has a lot to offer."

Article:
http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html


Quoting the article:

> One of the best things we can do is minimize the amount of memorywe’re allocating; we allocate a new char[] every time we read a line.

This is wrong. byLine reuses buffer if its mutable which is the casewith char[]. I recommend authors to always double checking hypothesisbefore stating it in article, especially about performance.


Observe:
https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1660
https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1652

And notice a warning about reusing the buffer here:

https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1741

Reddit:
http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/


Why is File.byLine so slow?

Seems to be mostly fixed sometime ago. It's slower then straight fgetsbut it's not that bad.

Also nearly optimal solution using C's fgets with growable buffer is waysimpler then outlined code in the article. Or we can mmap the file too.

Having to work around the standard library
defeats the point of a standard library.

Truth be told the most of slowdown should be in eager split, notablywith GC allocation per line. It may also trigger GC collection aftersplitting many lines, maybe even many collections.

The easy way out is to use standard _splitter_ which is lazy andnon-allocating. Which is a _2-letter_ change, and still using niceclean standard function.

Article was really disappointing for me because I expected to see thatsingle line change outlined above to fix the 80% of problem elegantly.Instead I observe 100+ spooky lines that needlessly maintain 3 buffersat the same time (how scientific) instead of growing single one toamortize the cost. And then a claim that's nice to be able to improvespeed so easily.



--
Dmitry Olshansky

Re: D is for Data Science

Reply via email to