25-Nov-2014 00:34, weaselcat пишет:
On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote:
Just browsing reddit and found this article posted about D.
Written by Andrew Pascoe of AdRoll.

From the article:
"The D programming language has quickly become our language of choice
on the Data Science team for any task that requires efficiency, and is
now the keystone language for our critical infrastructure. Why?
Because D has a lot to offer."

Article:
http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html


Quoting the article:

> One of the best things we can do is minimize the amount of memory we’re allocating; we allocate a new char[] every time we read a line.

This is wrong. byLine reuses buffer if its mutable which is the case with char[]. I recommend authors to always double checking hypothesis before stating it in article, especially about performance.

Observe:
https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1660
https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1652

And notice a warning about reusing the buffer here:

https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1741

Reddit:
http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/


Why is File.byLine so slow?

Seems to be mostly fixed sometime ago. It's slower then straight fgets but it's not that bad.

Also nearly optimal solution using C's fgets with growable buffer is way simpler then outlined code in the article. Or we can mmap the file too.

Having to work around the standard library
defeats the point of a standard library.

Truth be told the most of slowdown should be in eager split, notably with GC allocation per line. It may also trigger GC collection after splitting many lines, maybe even many collections.

The easy way out is to use standard _splitter_ which is lazy and non-allocating. Which is a _2-letter_ change, and still using nice clean standard function.

Article was really disappointing for me because I expected to see that single line change outlined above to fix the 80% of problem elegantly. Instead I observe 100+ spooky lines that needlessly maintain 3 buffers at the same time (how scientific) instead of growing single one to amortize the cost. And then a claim that's nice to be able to improve speed so easily.


--
Dmitry Olshansky

Reply via email to