from:"Jon Degenhardt via Digitalmars\-d"

Re: to compose or hack?

2021-07-06 Thread Jon Degenhardt via Digitalmars-d-learn

On Wednesday, 7 July 2021 at 01:44:20 UTC, Steven Schveighoffer 
wrote:
This is pretty minimal, but does what I want it to do. Is it 
ready for inclusion in Phobos? Not by a longshot! A truly 
generic interleave would properly forward everything else that 
the range supports (like `length`, `save`, etc).


But it got me thinking, how often do people roll their own vs. 
trying to compose using existing Phobos nuggets? I found this 
pretty satisfying, even if I didn't test it to death and maybe 
I use it only in one place. Do you find it difficult to use 
Phobos in a lot of situations to compose your specialized 
ranges?


I try to compose using existing Phobos facilities, but don't 
hesitate to write my own ranges. The reasons are usually along 
the lines you describe.


For one, range creation is easy in D, consistent with the pro/con 
tradeoffs described in the thread/talk [Iterator and Ranges: 
Comparing C++ to D to 
Rust](https://forum.dlang.org/thread/diexjstekiyzgxlic...@forum.dlang.org). Another is that if application/task specific logic is involved, it is often simpler/faster to just incorporate it into the range rather than figure out how to factor it out of the more general range. Especially if the range is not going to be used much.


--Jon

Re: Need for speed

2021-04-01 Thread Jon Degenhardt via Digitalmars-d-learn


On Thursday, 1 April 2021 at 19:55:05 UTC, H. S. Teoh wrote:
On Thu, Apr 01, 2021 at 07:25:53PM +, matheus via 
Digitalmars-d-learn wrote: [...]
Since this is a "Learn" part of the Foruam, be careful with 
"-boundscheck=off".


I mean for this little snippet is OK, but for a other projects 
this my be wrong, and as it says here: 
https://dlang.org/dmd-windows.html#switch-boundscheck


"This option should be used with caution and as a last resort 
to improve performance. Confirm turning off @safe bounds 
checks is worthwhile by benchmarking."

[...]

It's interesting that whenever a question about D's performance 
pops up in the forums, people tend to reach for optimization 
flags.  I wouldn't say it doesn't help; but I've found that 
significant performance improvements can usually be obtained by 
examining the code first, and catching common newbie mistakes.  
Those usually account for the majority of the observed 
performance degradation.


Only after the code has been cleaned up and obvious mistakes 
fixed, is it worth reaching for optimization flags, IMO.


This is my experience as well, and not just for D. Pick good 
algorithms and pay attention to memory allocation. Don't go crazy 
on the latter. Many people try to avoid GC at all costs, but I 
don't usually find it necessary to go quite that far. Very often 
simply reusing already allocated memory does the trick. The blog 
post I wrote a few years ago focuses on these ideas: 
https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/


--Jon

Re: Silicon Valley D Meetup - March 18, 2021 - "Templates in the D Programming Language" by Ali Çehreli

2021-03-21 Thread Jon Degenhardt via Digitalmars-d-announce


On Friday, 19 March 2021 at 17:10:27 UTC, Ali Çehreli wrote:

Jon mentioned how PR 7678 reduced the performance of 
std.regex.matchOnce. After analyzing the code we realized that 
the performance loss must be due to two delegate context 
allocations:


https://github.com/dlang/phobos/pull/7678/files#diff-269abc020de3a951eaaa5b8eca5a0700ba8b298767c7a64f459e74e1531a80aeR825

One delegate is 'matchOnceImp' and the other one is the 
anonymous delegate created on the return expression.


We understood that 'matchOnceImp' could not be a nested 
function because of an otherwise useful rule: the name of the 
nested function alone would *call* that function instead of 
being a symbol for it. That is not the case for a local 
delegate variable, so that's why 'matchOnceImp' exists as a 
delegate variable there.


Then there is the addition of the 'pure' attribute to it. 
Fine...


After tinkering with the code, we realized that the same effect 
can be achieved with a static member function of a static 
struct, which would not allocate any delegate context. I add 
@nogc to the following code to prove that point. The following 
code is even simpler than Jon and I came up with yesterday.


[... Code snippet removed ...]

There: we injected @trusted code inside a @nogc @safe function.

Question to others: Did we understand the reason for the 
convoluted code in that PR fully? Is the above method really a 
better solution?


I submitted PR 7902 (https://github.com/dlang/phobos/pull/7902) 
to address this. I wasn't able to use the version Ali showed in 
the post, but the PR does use what is essentially the same idea 
identified at the D Meetup. It is a performance regression, and 
is a bit more nuanced than would be ideal. Comments and review 
would be appreciated.


--Jon

Re: Trying to reduce memory usage

2021-02-22 Thread Jon Degenhardt via Digitalmars-d-learn


On Tuesday, 23 February 2021 at 00:08:40 UTC, tsbockman wrote:
On Friday, 19 February 2021 at 00:13:19 UTC, Jon Degenhardt 
wrote:
It would be interesting to see how the performance compares to 
tsv-uniq 
(https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The 
prebuilt binaries turn on all the optimizations 
(https://github.com/eBay/tsv-utils/releases).


My program (called line-dedup below) is modestly faster than 
yours, with the gap gradually widening as files get bigger. 
Similarly, when not using a memory-mapped scratch file, my 
program is modestly less memory hungry than yours, with the gap 
gradually widening as files get bigger.


In neither case is the difference very exciting though; the 
real benefit of my algorithm is that it can process files too 
large for physical memory. It might also handle frequent hash 
collisions better, and could be upgraded to handle huge numbers 
of very short lines efficiently.


Thanks for running the comparison! I appreciate seeing how other 
implementations compare.


I'd characterize the results a differently though. Based on the 
numbers, line-dedup is materially faster than tsv-uniq, at least 
on the tests run. To your point, it may not make much practical 
difference on data sets that fit in memory. tsv-uniq is fast 
enough for most needs. But it's still a material performance 
delta. Nice job!


I agree also that the bigger pragmatic benefit is fast processing 
of files much larger than will fit in memory. There are other 
useful problems like this. One I often need is creating a random 
weighted ordering. Easy to do for data sets that fit in memory, 
but hard to do fast for data sets that do not.


--Jon

Re: Trying to reduce memory usage

2021-02-18 Thread Jon Degenhardt via Digitalmars-d-learn


On Wednesday, 17 February 2021 at 04:10:24 UTC, tsbockman wrote:
I spent some time experimenting with this problem, and here is 
the best solution I found, assuming that perfect de-duplication 
is required. (I'll put the code up on GitHub / dub if anyone 
wants to have a look.)


It would be interesting to see how the performance compares to 
tsv-uniq 
(https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The 
prebuilt binaries turn on all the optimizations 
(https://github.com/eBay/tsv-utils/releases).


tsv-uniq wasn't included in the different comparative benchmarks 
I published, but I did run my own benchmarks and it holds up 
well. However, it should not be hard to beat it. What might be 
more interesting is what the delta is.


tsv-uniq is using the most straightforward approach of popping 
things into an associate array. No custom data structures. Enough 
memory is required to hold all the unique keys in memory, so it 
won't handle arbitrarily large data sets. It would be interesting 
to see how the straightforward approach compares with the more 
highly tuned approach.


--Jon

Re: Article: Why I use the D programming language for scripting

2021-01-31 Thread Jon Degenhardt via Digitalmars-d-announce


On Sunday, 31 January 2021 at 20:36:43 UTC, aberba wrote:

It's finally out!

https://opensource.com/article/21/1/d-scripting


Very nice! Clearly I'm not taking enough advantage of scripting 
capabilities!


--Jon

Re: std.algorithm.splitter on a string not always bidirectional

2021-01-22 Thread Jon Degenhardt via Digitalmars-d-learn

On Friday, 22 January 2021 at 17:29:08 UTC, Steven Schveighoffer 
wrote:

On 1/22/21 11:57 AM, Jon Degenhardt wrote:


I think the idea is that if a construct like 
'xyz.splitter(args)' produces a range with the sequence of 
elements {"a", "bc", "def"}, then 'xyz.splitter(args).back' 
should produce "def". But, if finding the split points 
starting from the back results in something like {"f", "de", 
"abc"} then that relationship hasn't held, and the results are 
unexpected.


But that is possible with all 3 splitter variants. Why is one 
allowed to be bidirectional and the others are not?


I'm not defending it, just explaining what I believe the thinking 
was based on the examination I did. It wasn't just looking at the 
code, there was a discussion somewhere. A forum discussion, PR 
discussion, bug or code comments. Something somewhere, but I 
don't remember exactly.


However, to answer your question - The relationship described is 
guaranteed if the basis for the split is a single element. If the 
range is a string, that's a single 'char'. If the range is 
composed of integers, then a single integer. Note that if the 
basis for the split is itself a range, then the relationship 
described is not guaranteed.


Personally, I can see a good argument that bidirectionality 
should not be supported in any of these cases, and instead force 
the user to choose between eager splitting or reversing the range 
via retro. For the common case of strings, the further argument 
could be made that the distinction between char and dchar is 
another point of inconsistency.


Regardless whether the choices made were the best choices, there 
was some thinking that went into it, and it is worth 
understanding the thinking when considering changes.


--Jon

Re: std.algorithm.splitter on a string not always bidirectional

2021-01-22 Thread Jon Degenhardt via Digitalmars-d-learn

On Friday, 22 January 2021 at 14:14:50 UTC, Steven Schveighoffer 
wrote:

On 1/22/21 12:55 AM, Jon Degenhardt wrote:
On Friday, 22 January 2021 at 05:51:38 UTC, Jon Degenhardt 
wrote:
On Thursday, 21 January 2021 at 22:43:37 UTC, Steven 
Schveighoffer wrote:

auto sp1 = "a|b|c".splitter('|');

writeln(sp1.back); // ok

auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

writeln(sp2.back); // error, not bidirectional

Why? is it an oversight, or is there a good reason for it?



I believe the reason is two-fold. First, splitter is lazy. 
Second, the range splitting is defined in the forward 
direction, not the reverse direction. A bidirectional range 
is only supported if it is guaranteed that the splits will 
occur at the same points in the range when run in either 
direction. That's why the single element delimiter is 
supported. Its clearly the case for the predicate function in 
your example. If that's known to be always true then perhaps 
it would make sense to enhance splitter to generate 
bidirectional results in this case.




Note that the predicate might use a random number generator to 
pick the split points. Even for same sequence of random 
numbers, the split points would be different if run from the 
front than if run from the back.


I think this isn't a good explanation.

All forms of splitter accept a predicate (including the one 
which supports a bi-directional result). Many other phobos 
algorithms that accept a predicate provide bidirectional 
support. The splitter result is also a forward range (which 
makes no sense in the context of random splits).


Finally, I'd suggest that even if you split based on a subrange 
that is also bidirectional, it doesn't make sense that you 
couldn't split backwards based on that. Common sense says a 
range split on substrings is the same whether you split it 
forwards or backwards.


I can do this too (and in fact I will, because it works, even 
though it's horrifically ugly):


auto sp3 = "a.b|c".splitter!((c, unused) => 
!isAlphaNum(c))('?');


writeln(sp3.back); // ok

Looking at the code, it looks like the first form of spltter 
uses a different result struct than the other two (which have a 
common implementation). It just needs cleanup.


-Steve


I think the idea is that if a construct like 'xyz.splitter(args)' 
produces a range with the sequence of elements {"a", "bc", 
"def"}, then 'xyz.splitter(args).back' should produce "def". But, 
if finding the split points starting from the back results in 
something like {"f", "de", "abc"} then that relationship hasn't 
held, and the results are unexpected.


Note that in the above example, 'xyz.retro.splitter(args)' might 
produce {"f", "ed", "cba"}, so again not the same.


Another way to look at it: If split (eager) took a predicate, 
that 'xyz.splitter(args).back' and 'xyz.split(args).back' should 
produce the same result. But they will not with the example given.


I believe these consistency issues are the reason why the 
bidirectional support is limited.


Note: I didn't design any of this, but I did redo the examples in 
the documentation at one point, which is why I looked at this.


--Jon

Re: std.algorithm.splitter on a string not always bidirectional

2021-01-21 Thread Jon Degenhardt via Digitalmars-d-learn


On Friday, 22 January 2021 at 05:51:38 UTC, Jon Degenhardt wrote:
On Thursday, 21 January 2021 at 22:43:37 UTC, Steven 
Schveighoffer wrote:

auto sp1 = "a|b|c".splitter('|');

writeln(sp1.back); // ok

auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

writeln(sp2.back); // error, not bidirectional

Why? is it an oversight, or is there a good reason for it?

-Steve


I believe the reason is two-fold. First, splitter is lazy. 
Second, the range splitting is defined in the forward 
direction, not the reverse direction. A bidirectional range is 
only supported if it is guaranteed that the splits will occur 
at the same points in the range when run in either direction. 
That's why the single element delimiter is supported. Its 
clearly the case for the predicate function in your example. If 
that's known to be always true then perhaps it would make sense 
to enhance splitter to generate bidirectional results in this 
case.


--Jon


Note that the predicate might use a random number generator to 
pick the split points. Even for same sequence of random numbers, 
the split points would be different if run from the front than if 
run from the back.

Re: std.algorithm.splitter on a string not always bidirectional

2021-01-21 Thread Jon Degenhardt via Digitalmars-d-learn

On Thursday, 21 January 2021 at 22:43:37 UTC, Steven 
Schveighoffer wrote:

auto sp1 = "a|b|c".splitter('|');

writeln(sp1.back); // ok

auto sp2 = "a.b|c".splitter!(v => !isAlphaNum(v));

writeln(sp2.back); // error, not bidirectional

Why? is it an oversight, or is there a good reason for it?

-Steve


I believe the reason is two-fold. First, splitter is lazy. 
Second, the range splitting is defined in the forward direction, 
not the reverse direction. A bidirectional range is only 
supported if it is guaranteed that the splits will occur at the 
same points in the range when run in either direction. That's why 
the single element delimiter is supported. Its clearly the case 
for the predicate function in your example. If that's known to be 
always true then perhaps it would make sense to enhance splitter 
to generate bidirectional results in this case.


--Jon

Re: Github Actions now support D out of the box!

2020-10-04 Thread Jon Degenhardt via Digitalmars-d-announce


On Friday, 21 August 2020 at 02:03:40 UTC, Mathias LANG wrote:

Hi everyone,
Almost a year ago, Ernesto Castelloti (@ErnyTech) submitted a 
PR for Github's "starter-workflow" to add support for D out of 
the box (https://github.com/actions/starter-workflows/pull/74). 
It was in a grey area for a while, as Github was trying to come 
up with a policy for external actions. I ended up picking up 
the project, after working with actions extensively for my own 
projects and the dlang org, and my PR was finally merged 
yesterday 
(https://github.com/actions/starter-workflows/pull/546).


A thank you to everyone who helped put this together. I just 
started using it, and it works quite well. It's a very valuable 
tool to have!


--Jon

Re: Why is BOM required to use unicode in tokens?

2020-09-15 Thread Jon Degenhardt via Digitalmars-d-learn

On Tuesday, 15 September 2020 at 14:59:03 UTC, Steven 
Schveighoffer wrote:

On 9/15/20 10:18 AM, James Blachly wrote:
What will it take (i.e. order of difficulty) to get this fixed 
-- will merely a bug report (and PR, not sure if I can tackle 
or not) do it, or will this require more in-depth discussion 
with compiler maintainers?


I'm thinking your issue will not be fixed (just like we don't 
allow $abc to be an identifier). But the spec can be fixed to 
refer to the correct standards.


Looks like it has to do with the '∂' character. But non-ascii 
alphabetic characters work generally.


# The 'Ш' and 'ä' characters are fine.
$ echo $'import std.stdio; void Шä() { writeln("Hello World!"); } 
void main() { Шä(); }' | dmd -run -

Hello World!

# But not '∂'
$ echo $'import std.stdio; void x∂() { writeln("Hello World!"); } 
void main() { x∂(); }' | dmd -run -

__stdin.d(1): Error: char 0x2202 not allowed in identifier
__stdin.d(1): Error: character 0x2202 is not a valid token
__stdin.d(1): Error: char 0x2202 not allowed in identifier
__stdin.d(1): Error: character 0x2202 is not a valid token

However, 'Ш' and 'ä' satisfy the definition of a Unicode letter, 
'∂' does not. (Using D's current Unicode definitions). I'll use 
tsv-filter (from tsv-utils) to show this rather than writing out 
the full D code. But, this uses std.regex.matchFirst().


# The input
$ echo $'x\n∂\nШ\nä'
x
∂
Ш
ä

The input filtered by Unicode letter '\p{L}'
$ echo $'x\n∂\nШ\nä' | tsv-filter --regex 1:'^\p{L}$'
x
Ш
ä

The spec can be made more clear and correct. But if a "universal 
alpha" is essentially about Unicode letters you might be looking 
for a change in the spec to use the symbol chosen.


--Jon

Re: Why is BOM required to use unicode in tokens?

2020-09-15 Thread Jon Degenhardt via Digitalmars-d-learn


On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote:
On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly 
wrote:
I wish to write a function including ∂x and ∂y (these are 
trivial to type with appropriate keyboard shortcuts - alt+d on 
Mac), but without a unicode byte order mark at the beginning 
of the file, the lexer rejects the tokens.


It is not apparently easy to insert such marks (AFAICT no 
common tool does this specifically), while other languages 
work fine (i.e., accept unicode in their source) without it.


Is there a downside to at least presuming UTF-8?


According to the spec [1] this should Just Work. I'd recommend 
filing a bug.


[1] https://dlang.org/spec/lex.html#source_text


Under the identifiers section 
(https://dlang.org/spec/lex.html#identifiers) it describes 
identifiers as:


Identifiers start with a letter, _, or universal alpha, and are 
followed by any number of letters, _, digits, or universal 
alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) 
Appendix D of the C99 Standard.


I was unable to find the definition of a "universal alpha", or 
whether that includes non-ascii alphabetic characters.

Re: Install multiple executables with DUB

2020-09-04 Thread Jon Degenhardt via Digitalmars-d-learn


On Friday, 4 September 2020 at 07:27:33 UTC, glis-glis wrote:
On Thursday, 3 September 2020 at 14:34:48 UTC, Jacob Carlborg 
wrote:
Oh, multiple binaries, I missed that. You can try to add 
multiple configurations [1]. Or if you have executables 
depending on only one source file, you can use single-file 
packages [2].


Thanks, but this still means I would have to write an 
install-script running


`dub build --single`

on each script, right?
I looked at tsv-utils [1] which seems to be a similar use-case 
as mine, and they declare each tool as a subpackage. The main 
package runs a d-file called `dub_build.d` which compiles all 
subpackages. Fells like an overkill to me, I'll probably just 
stick to a makefile.



[1] 
https://github.com/eBay/tsv-utils/blob/master/docs/AboutTheCode.md#building-and-makefile


The `dub_build.d` is so that people can use `$ dub fetch` to 
download and build the tools with `$ dub run`, from 
code.dlang.org. dub fetch/run is the typical dub sequence. But 
it's awkward. And it geared toward users that have a D compiler 
plus dub already installed. For building your own binaries you 
might as well use `make`. However, if you decide to add your 
tools to the public dub package registry you might consider the 
technique.


My understanding is that the dub developers recognize that 
multiple binaries are inconvenient at present and have ideas on 
improvements. Having a few more concrete use cases might help 
nail down the requirements.


The tsv-utils directory layout may be worth a look. It's been 
pretty successful for multiple binaries in a single repo with 
some shared code. (Different folks made suggestions leading to 
this structure.) It works for both make and dub, and works well 
with other tools, like dlpdocs (Adam Ruppe's doc generator). The 
tsv-utils `make` setup is quite messy at this point, you can 
probably do quite a bit better.


--Jon

Re: How to get the element type of an array?

2020-08-25 Thread Jon Degenhardt via Digitalmars-d-learn


On Tuesday, 25 August 2020 at 15:02:14 UTC, FreeSlave wrote:
On Tuesday, 25 August 2020 at 03:41:06 UTC, Jon Degenhardt 
wrote:
What's the best way to get the element type of an array at 
compile time?


Something like std.range.ElementType except that works on any 
array type. There is std.traits.ForeachType, but it wasn't 
clear if that was the right thing.


--Jon


Why not just use typeof(a[0])

It does not matter if array is empty or not. Typeof does not 
actually evaluate its expression, just the type.


Wow, yet another way that should have been obvious! Thanks!

--Jon

Re: How to get the element type of an array?

2020-08-25 Thread Jon Degenhardt via Digitalmars-d-learn

On Tuesday, 25 August 2020 at 12:50:35 UTC, Steven Schveighoffer 
wrote:
The situation is still confusing though. If only 
'std.range.ElementType' is imported, a static array does not 
have a 'front' member, but ElementType still gets the correct 
type. (This is where the documentation says it'll return void.)


You are maybe thinking of how C works? D imports are different, 
the code is defined the same no matter how it is imported. 
*your* module cannot see std.range.primitives.front, but the 
range module itself can see that UFCS function.


This is a good characteristic. But the reason it surprised me was 
that I expected to be able to manually expand the ElementType (or 
ElementEncodingType) template see the results of the expressions 
it uses.


   template ElementType(R)
   {
   static if (is(typeof(R.init.front.init) T))
   alias ElementType = T;
   else
   alias ElementType = void;
   }

So, yes, I was expecting this to behave like an inline code 
expansion.


Yesterday I was doing that for 'hasSlicing', which has a more 
complicated set of tests. I wanted to see exactly which 
expression in 'hasSlicing' was causing it to return false for a 
struct I wrote. (Turned out to be a test for 'length'.)


I'll have to be more careful about this.

Re: How to get the element type of an array?

2020-08-25 Thread Jon Degenhardt via Digitalmars-d-learn


On Tuesday, 25 August 2020 at 05:02:46 UTC, Basile B. wrote:
On Tuesday, 25 August 2020 at 03:41:06 UTC, Jon Degenhardt 
wrote:
What's the best way to get the element type of an array at 
compile time?


Something like std.range.ElementType except that works on any 
array type. There is std.traits.ForeachType, but it wasn't 
clear if that was the right thing.


--Jon


I'm curious to know what are the array types that were not 
accepted by ElementType ( or ElementEncodingType ) ?


Interesting. I need to test static arrays. In fact 'ElementType' 
does work with static arrays. Which is likely what you expected.


I assumed ElementType would not work, because static arrays don't 
satisfy 'isInputRange', and the documentation for ElementType 
says:


The element type is determined as the type yielded by r.front 
for an object r of type R. [...] If R doesn't have front, 
ElementType!R is void.


But, if std.range is imported, a static array does indeed get a 
'front' member. It doesn't satisfy isInputRange, but it does have 
a 'front' element.


The situation is still confusing though. If only 
'std.range.ElementType' is imported, a static array does not have 
a 'front' member, but ElementType still gets the correct type. 
(This is where the documentation says it'll return void.)


--- Import std.range ---
@safe unittest
{
import std.range;

ubyte[10] staticArray;
ubyte[] dynamicArray = new ubyte[](10);

static assert(is(ElementType!(typeof(staticArray)) == ubyte));
static assert(is(ElementType!(typeof(dynamicArray)) == 
ubyte));


// front is available
static assert(__traits(compiles, staticArray.front));
static assert(__traits(compiles, dynamicArray.front));

static assert(is(typeof(staticArray.front) == ubyte));
static assert(is(typeof(dynamicArray.front) == ubyte));
}

--- Import std.range.ElementType ---
@safe unittest
{
import std.range : ElementType;

ubyte[10] staticArray;
ubyte[] dynamicArray = new ubyte[](10);

static assert(is(ElementType!(typeof(staticArray)) == ubyte));
static assert(is(ElementType!(typeof(dynamicArray)) == 
ubyte));


// front is not available
static assert(!__traits(compiles, staticArray.front));
static assert(!__traits(compiles, dynamicArray.front));

static assert(!is(typeof(staticArray.front) == ubyte));
static assert(!is(typeof(dynamicArray.front) == ubyte));
}

This suggests the documentation for ElementType not quite correct.

Re: How to get the element type of an array?

2020-08-24 Thread Jon Degenhardt via Digitalmars-d-learn


On Tuesday, 25 August 2020 at 04:36:56 UTC, H. S. Teoh wrote:

[...]


Harry Gillanders, H.S. Teoh,

Thank you both for the quick replies. Both methods address my 
needs. Very much appreciated, I was having trouble figuring this 
one out.


--Jon

How to get the element type of an array?

2020-08-24 Thread Jon Degenhardt via Digitalmars-d-learn

What's the best way to get the element type of an array at 
compile time?


Something like std.range.ElementType except that works on any 
array type. There is std.traits.ForeachType, but it wasn't clear 
if that was the right thing.


--Jon

Re: Github Actions now support D out of the box!

2020-08-21 Thread Jon Degenhardt via Digitalmars-d-announce


On Friday, 21 August 2020 at 02:03:40 UTC, Mathias LANG wrote:

[...]


Thanks for the effort on this, I'll definitely be checking it out!

--Jon

Re: getopt Basic usage

2020-08-15 Thread Jon Degenhardt via Digitalmars-d-learn


On Saturday, 15 August 2020 at 04:09:19 UTC, James Gray wrote:
I am trying to use getopt and would not like the program to 
throw an unhandled exception when parsing command line options. 
Is the following, adapted from the first example in the getopt 
documentation, a reasonable approach?


I use the approach you showed, except for writing errors to 
stderr and returning an exit status. This has worked fine. An 
example: 
https://github.com/eBay/tsv-utils/blob/master/number-lines/src/tsv_utils/number-lines.d#L48

Re: Reading from stdin significantly slower than reading file directly?

2020-08-13 Thread Jon Degenhardt via Digitalmars-d-learn

On Thursday, 13 August 2020 at 14:41:02 UTC, Steven Schveighoffer 
wrote:
But for sure, reading from stdin doesn't do anything different 
than reading from a file if you are using the File struct.


A more appropriate test might be using the shell to feed the 
file into the D program:


dprogram < FILE

Which means the same code runs for both tests.


Indeed, using the 'prog < file' approach rather than 'cat file | 
prog' indeed removes any distinction for 'tsv-select'. 
'tsv-select' uses File.rawRead rather than File.byLine.

Re: Reading from stdin significantly slower than reading file directly?

2020-08-13 Thread Jon Degenhardt via Digitalmars-d-learn

On Wednesday, 12 August 2020 at 22:44:44 UTC, methonash wrote:

Hi,

Relative beginner to D-lang here, and I'm very confused by the
apparent performance disparity I've noticed between programs
that do the following:

1) cat some-large-file | D-program-reading-stdin-byLine()

2) D-program-directly-reading-file-byLine() using File() struct

The D-lang difference I've noticed from options (1) and (2) is
somewhere in the range of 80% wall time taken (7.5s vs 4.1s),
which seems pretty extreme.

I don't know enough details of the implementation to really
answer the question, and I expect it's a bit complicated.

However, it's an interesting question, and I have relevant
programs and data files, so I tried to get some actuals.

The tests I ran don't directly answer the question posed, but may
be a useful proxy. I used Unix 'cut' (latest GNU version) and
'tsv-select' from the tsv-utils package
(https://github.com/eBay/tsv-utils). 'tsv-select' is written in
D, and works like 'cut'. 'tsv-select' reads from stdin or a file
via a 'File' struct. It's not using the built-in 'byLine' member
though, it uses a version of 'byLine' that includes some
additional buffering. Both stdin and a file system file are read
this way.

I used a file from the google ngram collection
(http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) and the file TREE_GRM_ESTN.csv from https://apps.fs.usda.gov/fia/datamart/CSV/datamart_csv.html, converted to a tsv file.

The ngram file is a narrow file (21 bytes/line, 4 columns), the
TREE file is wider (206 bytes/line, 49 columns). In both cases I
cut the 2nd and 3rd columns. This tends to focus processing on
input rather than processing and output. I also timed 'wc -l' for
another data point.

I ran the benchmarks 5 times each way and recorded the median
time below. Machine used is a MacMini (so Mac OS) with 16 GB RAM
and SSD drives. The numbers are very consisent for this test on
this machine. Differences in the reported times are real deltas,
not system noise. The commands timed were:

* bash -c 'tsv-select -f 2,3 FILE > /dev/null'
* bash -c 'cat FILE | tsv-select -f 2,3 > /dev/null'
* bash -c 'gcut -f 2,3 FILE > /dev/null'
* bash -c 'cat FILE | gcut -f 2,3 > /dev/null'
* bash -c 'gwc -l FILE > /dev/null'
* bash -c 'cat FILE | gwc -l > /dev/null'

Note that 'gwc' and 'gcut' are the GNU versions of 'wc' and 'cut'
installed by Homebrew.

Google ngram file (the 's' unigram file):

Test Elapsed System User
--- --
tsv-select -f 2,3 FILE 10.280.42 9.85
cat FILE | tsv-select -f 2,311.101.45 10.23
cut -f 2,3 FILE 14.640.60 14.03
cat FILE | cut -f 2,3 14.361.03 14.19
wc -l FILE 1.320.39 0.93
cat FILE | wc -l 1.180.96 1.04

The TREE file:

Test Elapsed System User
--- --
tsv-select -f 2,3 FILE 3.770.95 2.81
cat FILE | tsv-select -f 2,3 4.542.65 3.28
cut -f 2,3 FILE 17.781.53 16.24
cat FILE | cut -f 2,3 16.772.64 16.36
wc -l FILE 1.380.91 0.46
cat FILE | wc -l 2.022.63 0.77

What this shows is that 'tsv-select' (D program) was faster when
reading from a file than when reading from a standard input. It
doesn't indicate why or whether the delta is due to code D
library or code in 'tsv-select'.

Interestingly, 'cut' showed the opposite behavior. It was faster
when reading from standard input than when reading from the file.
For 'wc', which method was faster was dependent on line length.

Again, I caution against reading too much into this regarding
performance of reading from standard input vs a disk file. Much
more definitive tests can be done. However, it is an interesting
comparison.

Also, the D program is still fast in both cases.

--Jon

Re: tsv-utils 2.0 release: Named field support

2020-07-29 Thread Jon Degenhardt via Digitalmars-d-announce


On Tuesday, 28 July 2020 at 15:57:57 UTC, bachmeier wrote:
Thanks for your work. I've recommended tsv-utils to some 
students for their data analysis. It's a nice substitute for a 
database depending on what you're doing. It really helps that 
you store can store your "database" in repo like any other text 
file. I'm going to be checking out the new version soon.


Thanks for the support and for checking out tools! Much 
appreciated.

Re: tsv-utils 2.0 release: Named field support

2020-07-27 Thread Jon Degenhardt via Digitalmars-d-announce


On Monday, 27 July 2020 at 14:32:27 UTC, aberba wrote:

On Sunday, 26 July 2020 at 20:28:56 UTC, Jon Degenhardt wrote:
I'm happy to announce a new major release of eBay's TSV 
Utilities. The 2.0 release supports named field selection in 
all of the tools, a significant usability enhancement.


So I didn't checked it out until today and I'm really impressed 
about the documentation, presentation and just about everything.


Thanks for the kind words, and for taking the time to check out 
the toolkit. Both are very much appreciated!

tsv-utils 2.0 release: Named field support

2020-07-26 Thread Jon Degenhardt via Digitalmars-d-announce


Hi all,

I'm happy to announce a new major release of eBay's TSV 
Utilities. The 2.0 release supports named field selection in all 
of the tools, a significant usability enhancement.


For those not familiar, tsv-utils is a set of command line tools 
for manipulating tabular data files of the type commonly found in 
machine learning and data mining environments. Filtering, 
statistics, sampling, joins, etc. The tools are patterned after 
traditional Unix common line tools like 'cut', 'grep', 'sort', 
etc., and are intended to work with these tools. Each tool is a 
standalone executable. Most people will only care about a subset 
of the tools. It is not necessary to learn the entire toolkit to 
get value from the tools.


The tools are all written in D and are the fastest tools of their 
type available (benchmarks are on the GitHub repository).


Previous versions of the tools referenced fields by field number, 
same as traditional Unix tools like 'cut'. In version 2.0, 
tsv-utils tools take fields either by field number or by field 
name, for files with header lines. A few examples using 
'tsv-select', a tool similar to 'cut' that also supports field 
reordering and dropping fields:


$ # Field numbers: Output fields 2 and 1, in that order.
$ tsv-select -f 2,1 data.tsv

$ # Field names: Output the 'Name' and 'RecordNum' fields.
$ tsv-select -H -f Name,RecordNum data.tsv

$ # Drop the 'Color' field, keep everything else.
$ tsv-select -H --exclude Color file.tsv

$ # Drop all the fields ending in '_time'
$ tsv-select -H -e '*_time' data.tsv

More information is available on the tsv-utils GitHub repository, 
including documentation and pre-built binaries: 
https://github.com/eBay/tsv-utils


--Jon

Re: getopt: How does arraySep work?

2020-07-16 Thread Jon Degenhardt via Digitalmars-d-learn

On Thursday, 16 July 2020 at 17:40:25 UTC, Steven Schveighoffer 
wrote:

On 7/16/20 1:13 PM, Andre Pany wrote:
On Thursday, 16 July 2020 at 05:03:36 UTC, Jon Degenhardt 
wrote:

On Wednesday, 15 July 2020 at 07:12:35 UTC, Andre Pany wrote:

[...]


An enhancement is likely to hit some corner-cases involving 
list termination requiring choices that are not fully 
generic. Any time a legal list value looks like a legal 
option. Perhaps the most important case is single digit 
numeric options like '-1', '-2'. These are legal short form 
options, and there are programs that use them. They are also 
somewhat common numeric values to include in command lines 
inputs.


[...]


My naive implementation would be that any dash would stop the 
list of multiple values. If you want to have a value 
containing a space or a dash, you enclose it with double 
quotes in the terminal.


Enclose with double quotes in the terminal does nothing:

myapp --modelicalibs "file-a.mo" "file-b.mo"

will give you EXACTLY the same string[] args as:

myapp --modelicalibs file-a.mo file-b.mo

I think Jon's point is that it's difficult to distinguish where 
an array list ends if you get the parameters as separate items.


Like:

myapp --numbers 1 2 3 -5 -6

Is that numbers=> [1, 2, 3, -5, -6]

or is it numbers=> [1, 2, 3], 5 => true, 6 => true

This is probably why the code doesn't support that.

-Steve


Yes, this what I was getting. Thanks for the clarification.

Also, it's not always immediately obvious what part of the 
argument splitting is being done by the shell, and what is being 
done by the program/getopt. Taking inspiration from the recent 
one-liners, here's way to see how the program gets the args from 
the shell for different command lines:


$ echo 'import std.stdio; void main(string[] args) { args[1 .. 
$].writeln; }' | dmd -run - --numbers 1,2,3,-5,-6

["--numbers", "1,2,3,-5,-6"]

$ echo 'import std.stdio; void main(string[] args) { args[1 .. 
$].writeln; }' | dmd -run - --numbers 1 2 3 -5 -6

["--numbers", "1", "2", "3", "-5", "-6"]

$ echo 'import std.stdio; void main(string[] args) { args[1 .. 
$].writeln; }' | dmd -run - --numbers "1" "2" "3" "-5" "-6"

["--numbers", "1", "2", "3", "-5", "-6"]

$ echo 'import std.stdio; void main(string[] args) { args[1 .. 
$].writeln; }' | dmd -run - --numbers '1 2 3 -5 -6'

["--numbers", "1 2 3 -5 -6"]

The first case is what getopt supports now - All the values in a 
single string with a separator that getopt splits on. The 2nd and 
3rd are identical from the program's perspective (Steve's point), 
but they've already been split, so getopt would need a different 
approach. And requires dealing with ambiguity. The fourth form 
eliminates the ambiguity, but puts the burden on the user to use 
quotes.

Re: getopt: How does arraySep work?

2020-07-15 Thread Jon Degenhardt via Digitalmars-d-learn


On Wednesday, 15 July 2020 at 07:12:35 UTC, Andre Pany wrote:

On Tuesday, 14 July 2020 at 15:48:59 UTC, Andre Pany wrote:
On Tuesday, 14 July 2020 at 14:33:47 UTC, Steven Schveighoffer 
wrote:

On 7/14/20 10:22 AM, Steven Schveighoffer wrote:
The documentation needs updating, it should say "parameters 
are added sequentially" or something like that, instead of 
"separation by whitespace".


https://github.com/dlang/phobos/pull/7557

-Steve


Thanks for the answer and the pr. Unfortunately my goal here 
is to simulate a partner tool written  in C/C++ which supports 
this behavior. I will also create an enhancement issue for 
supporting this behavior.


Kind regards
Anste


Enhancement issue:
https://issues.dlang.org/show_bug.cgi?id=21045

Kind regards
André


An enhancement is likely to hit some corner-cases involving list 
termination requiring choices that are not fully generic. Any 
time a legal list value looks like a legal option. Perhaps the 
most important case is single digit numeric options like '-1', 
'-2'. These are legal short form options, and there are programs 
that use them. They are also somewhat common numeric values to 
include in command lines inputs.


I ran into a couple cases like this with a getopt cover I wrote. 
The cover supports runtime processing of command arguments in the 
order entered on the command line rather than the compile-time 
getopt() call order. Since it was only for my stuff, not Phobos, 
it was an easy choice: Disallow single digit short options. But a 
Phobos enhancement might make other choices.


IIRC, a characteristic of the current getopt implementation is 
that it does not have run-time knowledge of all the valid 
options, so the set of ambiguous entries is larger than just the 
limited set of options specified in the program. Essentially, 
anything that looks syntactically like an option.


Doesn't mean an enhancement can't be built, just that there might 
some constraints to be aware of.


--Jon

Re: Looking for a Code Review of a Bioinformatics POC

2020-06-12 Thread Jon Degenhardt via Digitalmars-d-learn


On Friday, 12 June 2020 at 06:20:59 UTC, H. S. Teoh wrote:
I glanced over the implementation of byLine.  It appears to be 
the unhappy compromise of trying to be 100% correct, cover all 
possible UTF encodings, and all possible types of input streams 
(on-disk file vs. interactive console).  It does UTF decoding 
and resizing of arrays, and a lot of other frilly little 
squirrelly things.  In fact I'm dismayed at how hairy it is, 
considering the conceptual simplicity of the task!


Given this, it will definitely be much faster to load in large 
chunks of the file at a time into a buffer, and scanning 
in-memory for linebreaks. I wouldn't bother with decoding at 
all; I'd just precompute the byte sequence of the linebreaks 
for whatever encoding the file is expected to be in, and just 
scan for that byte pattern and return slices to the data.


This is basically what bufferedByLine in tsv-utils does. See: 
https://github.com/eBay/tsv-utils/blob/master/common/src/tsv_utils/common/utils.d#L793.


tsv-utils has the advantage of only needing to support utf-8 
files with Unix newlines, so the code is simpler. (Windows 
newlines are detected, this occurs separately from 
bufferedByLine.) But as you describe, support for a wider variety 
of input cases could be done without sacrificing basic 
performance. iopipe provides much more generic support, and it is 
quite fast.


Having said all of that, though: usually in non-trivial 
programs reading input is the least of your worries, so this 
kind of micro-optimization is probably unwarranted except for 
very niche cases and for micro-benchmarks and other such toy 
programs where the cost of I/O constitutes a significant chunk 
of running times.  But knowing what byLine does under the hood 
is definitely interesting information for me to keep in mind, 
the next time I write an input-heavy program.


tsv-utils tools saw performance gains of 10-40% by moving from 
File.byLine to bufferedByLine, depending on tool and type of file 
(narrow or wide). Gains of 5-20% were obtained by switching from 
File.write to BufferedOutputRange, with some special cases 
improving by 50%. tsv-utils tools aren't micro-benchmarks, but 
they are not typical apps either. Most of the tools go into a 
tight loop of some kind, running a transformation on the input 
and writing to the output. Performance is a real benefit to these 
tools, as they get run on reasonably large data sets.

Re: Looking for a Code Review of a Bioinformatics POC

2020-06-11 Thread Jon Degenhardt via Digitalmars-d-learn


On Friday, 12 June 2020 at 00:58:34 UTC, duck_tape wrote:

On Thursday, 11 June 2020 at 23:45:31 UTC, H. S. Teoh wrote:


Hmm, looks like it's not so much input that's slow, but 
*output*. In fact, it looks pretty bad, taking almost as much 
time as overlap() does in total!


[snip...]


I'll play with that a bit tomorrow! I saw a nice implementation 
on eBay's tsvutils that I may need to look closer at.


Someone else suggested that stdout flushes per line by default. 
I dug around the stdlib but could confirm that. I also played 
around with setvbuf but it didn't seem to change anything.


Have you run into that before / know if stdout is flushing 
every newline? I'm not above opening '/dev/stdout' as a file of 
that writes faster.


I put some comparative benchmarks in 
https://github.com/jondegenhardt/dcat-perf. It  compares input 
and output using standard Phobos facilities (File.byLine, 
File.write), iopipe (https://github.com/schveiguy/iopipe), and 
the tsv-utils buffered input and buffered output facilities.


I haven't spent much time on results presentation, I know it's 
not that easy to read and interpret the results. Brief summary - 
On files with short lines buffering will result in dramatic 
throughput improvements over the standard phobos facilities. This 
is true for both input and output, through likely for different 
reasons. For input iopipe is the fastest available. tsv-utils 
buffered facilities are materially faster than phobos for both 
input and output, but not as fast as iopipe for input. Combining 
iopipe for input with tsv-utils BufferOutputRange for output 
works pretty well.


For files with long lines both iopipe and tsv-utils 
BufferedByLine are materially faster than Phobos File.byLine when 
reading. For writing there wasn't much difference from Phobos 
File.write.


A note on File.byLine - I've had many opportunities to compare 
Phobos File.byLine to facilities in other programming languages, 
and it is not bad at all. But it is beatable.


About Memory Mapped Files - The benchmarks don't include compare 
against mmfile. They certainly make sense as a comparison point.


--Jon

Re: On the D Blog: Lomuto's Comeback

2020-05-17 Thread Jon Degenhardt via Digitalmars-d-announce


On Thursday, 14 May 2020 at 13:26:23 UTC, Mike Parker wrote:
After reading a paper that grabbed his curiosity and wouldn't 
let go, Andrei set out to determine if Lomuto partitioning 
should still be considered inferior to Hoare for quicksort on 
modern hardware. This blog post details his results.


Blog:
https://dlang.org/blog/2020/05/14/lomutos-comeback/

Reddit:
https://www.reddit.com/r/programming/comments/gjm6yp/lomutos_comeback_quicksort_partitioning/

HN:
https://news.ycombinator.com/item?id=23179160


Got posted again to Hacker News earlier today. Currently at 
position 5.

Re: Idiomatic way to write a range that tracks how much it consumes

2020-04-27 Thread Jon Degenhardt via Digitalmars-d-learn


On Monday, 27 April 2020 at 05:06:21 UTC, anon wrote:
To implement your option A you could simply use 
std.range.enumerate.


Would something like this work?

import std.algorithm.iteration : map;
import std.algorithm.searching : until;
import std.range : tee;

size_t bytesConsumed;
auto result = input.map!(a => a.yourTransformation )
   .until!(stringTerminator)
   .tee!(a => bytesConsumed++);
// bytesConsumed is automatically updated as result is consumed


That's interesting. Wouldn't work quite like, but something 
similar would, but I don't think it quite achieves what I want.


One thing that's missing is that the initial input is simply a 
string, there's nothing to map over at that point. There is 
however a transformation step that transforms the string into a 
sequence of slices. Then there's a transformation on those 
slices. That would be a step prior to the 'map' step. Also, in my 
case 'map' cannot be used, because each slice may produce 
multiple outputs.


The specifics are minor details, not really so important. The 
implementation can take a form along the lines described. 
However, structuring like this exposes the details of these steps 
to all callers. That is, all callers would have to write the code 
above.


My goal is encapsulate the steps into a single range all callers 
can use. That is, encapsulate something like the steps you have 
above in a standalone range that takes the input string as an 
argument, produces all the output elements, and preserves the 
bytesConsumed in a way the caller can access it.

Re: Idiomatic way to write a range that tracks how much it consumes

2020-04-26 Thread Jon Degenhardt via Digitalmars-d-learn

On Monday, 27 April 2020 at 04:51:54 UTC, Steven Schveighoffer 
wrote:

On 4/26/20 11:38 PM, Jon Degenhardt wrote:

Is there a better way to write this?


I had exactly the same problems. I created this to solve the 
problem, I've barely tested it, but I plan to use it with all 
my parsing utilities on iopipe:


https://code.dlang.org/packages/bufref
https://github.com/schveiguy/bufref/blob/master/source/bufref.d


Thanks Steve, I'll definitely take a look at this.  --Jon

Re: Idiomatic way to write a range that tracks how much it consumes

2020-04-26 Thread Jon Degenhardt via Digitalmars-d-learn


On Monday, 27 April 2020 at 04:41:58 UTC, drug wrote:

27.04.2020 06:38, Jon Degenhardt пишет:


Is there a better way to write this?

--Jon


I don't know a better way, I think you enlist all possible ways 
- get a value using either `front` or special range member. I 
prefer the second variant, I don't think it is less consistent 
with range paradigms. Considering you need amount of consumed 
bytes only when range is empty the second way is more effective.


Thanks. Of two, I like the second better as well.

Idiomatic way to write a range that tracks how much it consumes

2020-04-26 Thread Jon Degenhardt via Digitalmars-d-learn

I have a string that contains a sequence of elements, then a 
terminator character, followed by a different sequence of 
elements (of a different type).


I want to create an input range that traverses the initial 
sequence. This is easy enough. But after the initial sequence has 
been traversed, the caller will need to know where the next 
sequence starts. That is, the caller needs to know the index in 
the input string where the initial sequence ends and the next 
sequence begins.


The values returned by the range are a transformation of the 
input, so the values by themselves are insufficient for the 
caller to determined how much of the string has been consumed. 
And, the caller cannot simply search for the terminator character.


Tracking the number of bytes consumed is easy enough. I like to 
do in a way that is consistent with D's normal range paradigm.


Two candidate approaches:
a) Instead of having the range return the individual values, it 
could return a tuple containing the value and the number of bytes 
consumed.


b) Give the input range an extra member function which returns 
the number of bytes consumed. The caller could call this after 
'empty()' returns true to find the amount of data consumed.


Both will work, but I'm not especially satisfied with either. 
Approach (a) seems more consistent with the typical range 
paradigms, but also more of a hassle for callers.


Is there a better way to write this?

--Jon

Re: Integration tests

2020-04-17 Thread Jon Degenhardt via Digitalmars-d-learn


On Friday, 17 April 2020 at 16:56:57 UTC, Russel Winder wrote:

Hi,

Thinking of trying to do the next project in D rather than 
Rust, but…


Rust has built in unit testing on a module basis. D has this so 
no problem.


Rust allows for integration tests in the tests directory of a 
project. These are automatically build and run along with all 
unit tests as part of "cargo test".


Does D have any integrated support for integration tests in the 
way

Rust does?


Automated testing is important, perhaps you describe further 
what's needed? I haven't worked with Rust test frameworks, but I 
took a look at the description of the integration tests and unit 
tests. It wasn't immediately obvious what can be done with the 
Rust integration test framework that cannot be done with D's 
unittest framework.


An important concept described was testing a module as an 
external caller. That would seem very be doable using D's 
unittest framework. For example, one could create a set of tests 
against Phobos, put them in a separate location (e.g. a separate 
file), and arrange to have the unittests run as part of a CI 
process run along with a build.


My look was very superficial, perhaps you could explain more.

Re: How to correctly import tsv-utilites functions?

2020-04-15 Thread Jon Degenhardt via Digitalmars-d-learn


On Tuesday, 14 April 2020 at 20:25:08 UTC, p.shkadzko wrote:
On Tuesday, 14 April 2020 at 20:05:28 UTC, Steven Schveighoffer 
wrote:

On 4/14/20 3:34 PM, p.shkadzko wrote:

[...]



What about using dependency tsv-utils:common ?

Looks like tsv-utils is a collection of subpackages, and the 
main package just serves as a namespace.


-Steve


Yes, it works! Thank you.


Glad that worked for you. (And thanks Steve!) I have a small app 
with an example of a dub.json file that pulls the tsv-utils 
common dependencies this way: 
https://github.com/jondegenhardt/dcat-perf/blob/master/dub.json


--Jon

Re: Our HOPL IV submission has been accepted!

2020-02-28 Thread Jon Degenhardt via Digitalmars-d-announce

On Saturday, 29 February 2020 at 01:00:40 UTC, Andrei 
Alexandrescu wrote:
Walter, Mike, and I are happy to announce that our paper 
submission "Origins of the D Programming Language" has been 
accepted at the HOPL IV (History of Programming Languages) 
conference.


https://hopl4.sigplan.org/track/hopl-4-papers

Getting a HOPL paper in is quite difficult, and an important 
milestone for the D language. We'd like to thank the D 
community which was instrumental in putting the D language on 
the map.


The HOPL IV conference will take place in London right before 
DConf. With regard to travel, right now Covid-19 fears are on 
everybody's mind; however, we are hopeful that between now and 
then the situation will improve.


Congrats! Indeed a meaningful accomplishment.

New graphs for tsv-utils performance benchmarks

2020-01-29 Thread Jon Degenhardt via Digitalmars-d-announce

A small thing - Many people who have seen the performance 
benchmarks for eBay's TSV Utilities find the text table format 
I've used in the past hard to read. Me too. So, I finally 
generated more traditional graphical representations for the 2018 
benchmark results.


The graphs are here: 
https://github.com/eBay/tsv-utils/blob/master/docs/Performance.md#2018-benchmark-summary


There are no new benchmarks, just new visualizations of the 
results.


For folks who not familiar with these benchmarks - This is part 
of performance studies done by comparing eBay's TSV Utilities 
with a number of command line tools providing similar 
functionality (e.g. awk). The results shown were presented at 
DConf 2018.


* Details of the performance study - 
https://github.com/eBay/tsv-utils/blob/master/docs/Performance.md
* DConf 2018 talk slides - 
https://github.com/eBay/tsv-utils/blob/master/docs/dconf2018.pdf

Re: Unexpected result with std.conv.to

2019-11-14 Thread Jon Degenhardt via Digitalmars-d-learn


On Friday, 15 November 2019 at 03:51:04 UTC, Joel wrote:
I made a feature that converts, say, [9:59am] -> [10:00am] to 1 
minute. but found '9'.to!int = 57 (not 9).


Doesn't seem right... I'm guessing that's standard though, same 
with ldc.


Use a string or char[] array. e.g. writeln("9".to!int) => 9.

With a single 'char' what is being produced is the ascii value of 
the character.

Re: csvReader & specifying separator problems...

2019-11-14 Thread Jon Degenhardt via Digitalmars-d-learn

On Thursday, 14 November 2019 at 12:25:30 UTC, Robert M. Münch 
wrote:
Just trying a very simple thing and it's pretty hard: "Read a 
CSV file (raw_data) that has a ; separator so that I can 
iterate over the lines and access the fields."


csv_data = raw_data.byLine.joiner("\n")

From the docs, which I find extremly hard to understand:

auto csvReader(Contents = string, Malformed ErrorLevel = 
Malformed.throwException, Range, Separator = char)(Range input, 
Separator delimiter = ',', Separator quote = '"')


So, let's see if I can decyphre this, step-by-step by trying 
out:


csv_records = csv_data.csvReader();

Would split the CSV data into iterable CSV records using ',' 
char as separator using UFCS syntax. When running this I get:


[...]


Side comment - This code looks like it was taken from the first 
example in the std.csv documentation. To me, the code in the 
std.csv example is doing something that might not be obvious at 
first glance and is potentially confusing.


In particular, 'byLine' is not reading individual CSV records. 
CSV can have embedded newlines, these are identified by CSV 
escape syntax. 'byLine' doesn't know the escape syntax. If there 
are embedded newlines, 'byLine' will read partial records, which 
may not be obvious at first glance. The .joiner("\n") step puts 
the newline back, stitching fields and records back together 
again in the process.


The effect is to create an input range of characters representing 
the entire file, using 'byLine' to do buffered reads. This input 
range is passed to CSVReader.


This could also be done using 'byChunk' and 'joiner' (with no 
separator). This would use a fixed size buffer, no searching for 
newlines while reading, so it should be faster.


An example:

 csv_by_chunk.d 
import std.algorithm;
import std.csv;
import std.conv;
import std.stdio;
import std.typecons;
import std.utf;

void main()
{
// Small buffer used to show it works. Normally would use a 
larger buffer.

ubyte[16] buffer;
auto stdinBytes = stdin.byChunk(buffer).joiner;
auto stdinDChars = stdinBytes.map!((ubyte b) => cast(char) 
b).byDchar;


writefln("--");
foreach (record; stdinDChars.csvReader!(Tuple!(string, 
string, string)))

{
writefln("Field 0: |%s|", record[0]);
writefln("Field 1: |%s|", record[1]);
writefln("Field 2: |%s|", record[2]);
writefln("--");
}
}

Pass it csv data without embedded newlines:

$ echo $'abc,def,ghi\njkl,mno,pqr' | ./csv_by_chunk
--
Field 0: |abc|
Field 1: |def|
Field 2: |ghi|
--
Field 0: |jkl|
Field 1: |mno|
Field 2: |pqr|
--

Pass it csv data with embedded newlines:

$ echo $'abc,"LINE 1\nLINE 2",ghi\njkl,mno,pqr' | ./csv_by_chunk
--
Field 0: |abc|
Field 1: |LINE 1
LINE 2|
Field 2: |ghi|
--
Field 0: |jkl|
Field 1: |mno|
Field 2: |pqr|
--

An example like this may avoid the confusion about newlines. 
Unfortunately, the need to do the odd looking conversion from 
ubyte to char/dchar is undesirable in a code example. I haven't 
found a cleaner way to write that. If there's a nicer way I'd 
appreciate hearing about it.


--Jon

Re: formatting a float or double in a string with all significant digits kept

2019-10-10 Thread Jon Degenhardt via Digitalmars-d-learn


On Thursday, 10 October 2019 at 17:12:25 UTC, dan wrote:

Thanks also berni44 for the information about the dig attribute,
Jon for the neat packaging into one line using the attribute on 
the type.
Unfortunately, the version of gdc that comes with the version 
of debian
that i am using does not have the dig attribute yet, but 
perhaps i can

upgrade, and eventually i think gdc will have it.


Glad these ideas helped. The value of the 'double.dig' property 
is not going to change between compilers/versions/etc. It's 
really a property of IEEE 754 floating point for 64 bit floats. 
(D specified the size of double as 64).  So, if you are using 
double, then it's pretty safe to use 15 until the compiler you're 
using is further along on versions. Declare an enum or const 
variable to give it a name so you can track it down later.


Also, don't get thrown off by the PI is a real, not a double. D 
supports 80 bit floats as real, so constants like PI are defined 
as real. But if you convert PI to a double, it'll then have 15 
significant bits of precision.


--Jon

Re: formatting a float or double in a string with all significant digits kept

2019-10-09 Thread Jon Degenhardt via Digitalmars-d-learn


On Wednesday, 9 October 2019 at 05:46:12 UTC, berni44 wrote:

On Tuesday, 8 October 2019 at 20:37:03 UTC, dan wrote:
But i would like to be able to do this without knowing the 
expansion of pi, or writing too much code, especially if 
there's some d function like writeAllDigits or something 
similar.


You can use the property .dig to get the number of significant 
digits of a number:


writeln(PI.dig); // => 18

You still need to account for the numbers before the dot. If 
you're happy with scientific notation you can do:


auto t = format("%.*e", PI.dig, PI);
writeln("PI = ",t);


Using the '.dig' property is a really nice idea and looks very 
useful for this. A clarification though - It's the significant 
digits in the data type, not the value. (PI is 18 because it's a 
real, not a double.) So:


writeln(1.0f.dig, ", ", float.dig);  =>  6, 6
writeln(1.0.dig, ", ", double.dig);  => 15, 15
writeln(1.0L.dig, ", ", real.dig);   => 18, 18

Another possibility would be to combine the '.dig' property with 
the "%g" option, similar to the use "%e" shown. For example, 
these lines:


writeln(format("%0.*g", PI.dig, PI));
writeln(format("%0.*g", double.dig, 1.0));
writeln(format("%0.*g", double.dig, 100.0));
writeln(format("%0.*g", double.dig, 1.0001));
writeln(format("%0.*g", double.dig, 0.0001));

produce:

3.14159265358979324
1
100
1.0001
1e-08

Hopefully experimenting with the different formatting options 
available will yield one that works for your use case.

Re: LDC 1.17.0-beta1

2019-08-10 Thread Jon Degenhardt via Digitalmars-d-announce


On Saturday, 10 August 2019 at 15:51:28 UTC, kinke wrote:

Glad to announce the first beta for LDC 1.17:
...
Please help test, and thanks to all contributors!


No changes in my standard performance tests (good). All 
functional tests pass as well.

Re: Help me decide D or C

2019-08-02 Thread Jon Degenhardt via Digitalmars-d-learn


On Wednesday, 31 July 2019 at 18:38:02 UTC, Alexandre wrote:
Should I go for C and then when I become a better programmer 
change to D?

Should I start with D right now?


In my view, the most important thing is the decision you've 
already made - to pick a programming language and learn it in a 
reasonable bit of depth. Which programming language you choose is 
less important. No matter which choice you make you'll have the 
opportunity to learn skills that will transfer to other 
programming languages.


As you can tell from the other responses, the pros and cons of a 
learning a specific language depend quite a bit on what you hope 
to get out of it, and are to a fair extent subjective. But both C 
and D provide meaningful opportunities to gain worthwhile 
experience.


A couple reasons for considering learning D over C are its 
support for functional programming and templates. These were also 
mentioned by a few other people. These are not really "beginner" 
topics, but as one moves past the beginner stage they are really 
quite valuable techniques to start mastering. For both D is the 
far better option, and it's not necessary to use either when 
starting out.


--Jon

Re: rdmd takes 2-3 seconds on a first-run of a simple .d script

2019-05-26 Thread Jon Degenhardt via Digitalmars-d-learn


On Saturday, 25 May 2019 at 22:18:16 UTC, Andre Pany wrote:

On Saturday, 25 May 2019 at 08:32:08 UTC, BoQsc wrote:
I have a simple standard .d script and I'm getting annoyed 
that it takes 2-3 seconds to run and see the results via rdmd.


Also please keep in mind there could be other factors like slow 
disks, anti virus scanners,... which causes a slow down.


I have seen similar behavior that I attribute to virus scan 
software. After compiling a program, the first run takes several 
seconds to run, after that it runs immediately. I'm assuming the 
first run of an unknown binary triggers a scan, though I cannot 
be completely sure.


Try compiling a new binary in D or C++ and see if a similar 
effect is seen.


--Jon

Re: bool (was DConf 2019 AGM Livestream)

2019-05-12 Thread Jon Degenhardt via Digitalmars-d-announce


On Sunday, 12 May 2019 at 17:08:49 UTC, Jonathan M Davis wrote:

... snip ...
Fortunately, in the grand scheme of things, while this issue 
does matter, it's still much smaller than almost all of the 
issues that we have to worry about and consider having DIPs for.


Personally, I'm not at all happy that this DIP was rejected, 
but I think that continued debate on it is a waste of 
everyone's time.


Agreed. I too have never liked numeric values equated to 
true/false, in any programming language. However, it is very 
common. And, relative to other the big ticket items on the table, 
of minor importance. Changing the current behavior won't 
materially affect the usability of D or its future. This is a 
case where the best course is to make a decision move on.


--Jon

Re: eBay's TSV Utilities status update

2019-05-03 Thread Jon Degenhardt via Digitalmars-d-announce


On Friday, 3 May 2019 at 03:54:14 UTC, James Blachly wrote:

On 4/29/19 11:23 AM, Jon Degenhardt wrote:

An update on changes to this tool-set over the last year.

...
Thank you for this, and thanks for your blog post of a couple 
of years ago, which I referred to many times while learning D 
and writing fast(er) CLI tools.


Looking forward to trying Steve's iopipe as well as your 
bufferedByLineReader.


James


Thanks for the kind words James!

Re: Poor regex performance?

2019-04-04 Thread Jon Degenhardt via Digitalmars-d-learn


On Thursday, 4 April 2019 at 10:31:43 UTC, Julian wrote:
On Thursday, 4 April 2019 at 09:57:26 UTC, rikki cattermole 
wrote:

If you need performance use ldc not dmd (assumed).

LLVM has many factors better code optimizes than dmd does.


Thanks! I already had dmd installed from a brief look at D a 
long
time ago, so I missed the details at 
https://dlang.org/download.html


ldc2 -O3 does a lot better, but the result is still 30x slower
without PCRE.


Try:
ldc2 -O3 -release -flto=thin 
-defaultlib=phobos2-ldc-lto,druntime-ldc-lto -enable-inlining


This will improve inlining and optimization across the runtime 
library boundaries. This can help in certain types of code.

Dub: A json/sdl equivalent to --combined command line option?

2019-04-01 Thread Jon Degenhardt via Digitalmars-d-learn

In Dub, is there a way to specify the equivalent of the 
--combined command line argument in the json/sdl package config 
file?


What I'd like to be able to do is create a custom build type such 
that


$ dub build --build=build-xyz

builds in combined mode, without needing to add the --combined on 
the command line. Putting it on the command line as follows did 
what I intended:


   $ dub build --build=build-xyz --combined

--Jon

Re: NEW Milestone: 1500 packages at code.dlang.org

2019-02-07 Thread Jon Degenhardt via Digitalmars-d-announce

On Thursday, 7 February 2019 at 18:02:21 UTC, H. S. Teoh wrote:
On Thu, Feb 07, 2019 at 05:06:09PM +, Seb via 
Digitalmars-d-announce wrote:

On Thursday, 7 February 2019 at 16:40:08 UTC, Anonymouse wrote:
> 
> What was the word on the autotester (or similar) testing 
> popular

> packages as part of the test suite?

This is been done since more than a year now for the ~50 most 
popular packages: https://buildkite.com/dlang

In my opinion this is one of the main reasons why the last 
releases were so successful (=almost no regressions).

That's awesome. This is the way to go.  Congrats to everyone 
who helped pull this off.

T

Agreed! This is a really nice bit of work that's come out of the 
D ecosystem.

Re: D-lighted, I'm Sure

2019-01-19 Thread Jon Degenhardt via Digitalmars-d-announce


On Friday, 18 January 2019 at 14:29:14 UTC, Mike Parker wrote:
Not long ago, in my retrospective on the D Blog in 2018, I 
invited folks to write about their first impressions of D. Ron 
Tarrant, who you may have seen in the Lear forum, answered the 
call. The result is the latest post on the blog, the first 
guest post of 2019. Thanks, Ron!


As a reminder, I'm still looking for new-user impressions and 
guest posts on any D-related topic. Please contact me if you're 
interested. And don't forget, there's a bounty for guest posts, 
so you can make a bit of extra cash in the process.


The blog:
https://dlang.org/blog/2019/01/18/d-lighted-im-sure/

Reddit:
https://www.reddit.com/r/programming/comments/ahawhz/dlighted_im_sure_the_first_two_months_with_d/


Nicely done. Very enjoyable, thanks for publishing this!

--Jon

Re: My Meeting C++ Keynote video is now available

2019-01-12 Thread Jon Degenhardt via Digitalmars-d-announce

On Saturday, 12 January 2019 at 15:51:03 UTC, Andrei Alexandrescu 
wrote:

https://youtube.com/watch?v=tcyb1lpEHm0

If nothing else please watch the opening story, it's true and 
quite funny :o).


Now as to the talk, as you could imagine, it touches on another 
language as well...



Andrei


Very nice. I especially liked how design by introspection was 
contrasted with other approaches and how the constexpr discussion 
fit into the overall theme.


--Jon

Re: DCD, D-Scanner and DFMT : new year edition

2018-12-31 Thread Jon Degenhardt via Digitalmars-d-announce


On Monday, 31 December 2018 at 07:56:00 UTC, Basile B. wrote:
DCD [1] 0.10.2 comes with bugfixes and small API changes. DFMT 
[2] and D-Scanner [3] with bugfixes too and all of the three 
products are based on d-parse 0.10.z, making life easier and 
the libraries versions more consistent for the D IDE and D IDE 
plugins developers.


[1] https://github.com/dlang-community/DCD/releases/tag/v0.10.2
[2] https://github.com/dlang-community/dfmt/releases/tag/v0.9.0
[3] 
https://github.com/dlang-community/D-Scanner/releases/tag/v0.6.0


Thanks for the ongoing work on DCD et al!

Re: Which Docker to use?

2018-11-11 Thread Jon Degenhardt via Digitalmars-d-learn


On Monday, 22 October 2018 at 18:44:01 UTC, Jacob Carlborg wrote:

On 2018-10-21 20:45, Jon Degenhardt wrote:

The issue that caused me to go to Ubuntu 16.04 had to do with 
uncaught exceptions when using LTO with the gold linker and 
LDC 1.5. Problem occurred with 14.04, but not 16.04. I should 
go back and retest on Ubuntu 14.04 with a more recent LDC, it 
may well have been corrected. The issue thread is here: 
https://github.com/ldc-developers/ldc/issues/2390.


Ah, that might be the reason. I am not using LTO. You might 
want to try a newer version of LDC as well since 1.5 is quite 
old now.


I switched to LDC 1.12.0. The problem remains with LTO and static 
builds on Ubuntu 14.04. Ubuntu 16.04 is required, at least with 
LTO of druntime/phobos. The good news on this front is that the 
regularly updated dlang2 docker images work fine with LTO on 
druntime/phobos (using the LTO build support available in LDC 
1.9.0). Examples of travis-ci setups for both dlanguage and 
dlang2 docker images are available on the tsv-utils travis 
config: 
https://github.com/eBay/tsv-utils/blob/master/.travis.yml. Look 
for the DOCKERSPECIAL environment variables.

Re: d word counting approach performs well but has higher mem usage

2018-11-04 Thread Jon Degenhardt via Digitalmars-d-learn


On Saturday, 3 November 2018 at 14:26:02 UTC, dwdv wrote:

Hi there,

the task is simple: count word occurrences from stdin (around 
150mb in this case) and print sorted results to stdout in a 
somewhat idiomatic fashion.


Now, d is quite elegant while maintaining high performance 
compared to both c and c++, but I, as a complete beginner, 
can't identify where the 10x memory usage (~300mb, see results 
below) is coming from.


Unicode overhead? Internal buffer? Is something slurping the 
whole file? Assoc array allocations? Couldn't find huge allocs 
with dmd -vgc and -profile=gc either. What did I do wrong?


Not exactly the same problem, but there is relevant discussion in 
the blog post I wrote a while ago:  
https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/


See in particular the section on Associate Array lookup 
optimization. This takes advantage of the fact that it's only 
necessary to create the immutable string the first time a key is 
entered into the hash. Subsequent occurrences do not need to take 
this step. As creating allocates new memory, even if only used 
temporarily, this is a meaningful savings.


There have been additional APIs added to the AA interface since I 
wrote the blog post, I believe it is now possible to accomplish 
the same thing with more succinct code.


Other optimization possibilities:
* Avoid auto-decode: Not sure if your code is hitting this, but 
if so it's a significant performance hit. Unfortunately, it's not 
always obvious when this is happening. The task your are 
performing doesn't need auto-decode because it is splitting on 
single-byte utf-8 char boundaries (newline and space).


* LTO on druntime/phobos: This is easy and will have a material 
speedup. Simply add

'-defaultlib=phobos2-ldc-lto,druntime-ldc-lto'
to the 'ldc2' build line, after the '-flto=full' entry. This will 
be a win because it will enable a number of optimizations in the 
internal loop.


* Reading the whole file vs line by line - 'byLine' is really 
fast. It's also nice and general, as it allows reading arbitrary 
size files or standard input without changes to the code. 
However, it's not as fast as reading the file in a single shot.


* std.algorithm.joiner - Has improved dramatically, but is still 
slower than a foreach loop. See: 
https://github.com/dlang/phobos/pull/6492


--Jon

Re: Which Docker to use?

2018-10-21 Thread Jon Degenhardt via Digitalmars-d-learn


On Sunday, 21 October 2018 at 18:11:37 UTC, Jacob Carlborg wrote:

On 2018-10-18 01:15, Jon Degenhardt wrote:

I need to use docker to build static linked Linux executables. 
My reason
is specific, may be different than the OP's. I'm using 
Travis-CI to
build executables. Travis-CI uses Ubuntu 14.04, but static 
linking fails
on 14.04. The standard C library from Ubuntu 16.04 or later is 
needed.

There may be other/better ways to do this, I don't know.


That's interesting. I've built static binaries for DStep using 
LDC on Travis CI without any problems.


My comment painted too broad a brush. I had forgotten how 
specific the issue I saw was. Apologies for the confusion.


The issue that caused me to go to Ubuntu 16.04 had to do with 
uncaught exceptions when using LTO with the gold linker and LDC 
1.5. Problem occurred with 14.04, but not 16.04. I should go back 
and retest on Ubuntu 14.04 with a more recent LDC, it may well 
have been corrected. The issue thread is here: 
https://github.com/ldc-developers/ldc/issues/2390.

Re: Which Docker to use?

2018-10-20 Thread Jon Degenhardt via Digitalmars-d-learn


On Friday, 19 October 2018 at 22:16:04 UTC, Ky-Anh Huynh wrote:
On Wednesday, 17 October 2018 at 23:15:53 UTC, Jon Degenhardt 
wrote:


I need to use docker to build static linked Linux executables. 
My reason is specific, may be different than the OP's. I'm 
using Travis-CI to build executables. Travis-CI uses Ubuntu 
14.04, but static linking fails on 14.04. The standard C 
library from Ubuntu 16.04 or later is needed. There may be 
other/better ways to do this, I don't know.


Yes I'm also using Travis-CI and that's why I need some Docker 
support.


I'm using dlanguage/ldc. The reason for that choice was because 
it was what was available when I put the travis build together. 
As you mentioned, it hasn't been updated in a while. I'm still 
producing this build with an older ldc version, but when I move 
to a more current version I'll have to switch to a different 
docker image.


My travis config is here: 
https://github.com/eBay/tsv-utils/blob/master/.travis.yml. Look 
for the sections referencing the DOCKERSPECIAL environment 
variable.

Re: Which Docker to use?

2018-10-17 Thread Jon Degenhardt via Digitalmars-d-learn

On Wednesday, 17 October 2018 at 08:08:44 UTC, Gary Willoughby 
wrote:
On Wednesday, 17 October 2018 at 03:37:21 UTC, Ky-Anh Huynh 
wrote:

Hi,

I need to build some static binaries with LDC. I also need to 
execute builds on both platform 32-bit and 64-bit.



From Docker Hub there are two image groups:

* language/ldc (last update 5 months ago)
* dlang2/ldc-ubuntu (updated recently)


Which one do you suggest?

Thanks a lot.


To be honest, you don't need docker for this. You can just 
download LDC in a self-contained folder and use it as is.


https://github.com/ldc-developers/ldc/releases

That's what I do on Linux.


I need to use docker to build static linked Linux executables. My 
reason is specific, may be different than the OP's. I'm using 
Travis-CI to build executables. Travis-CI uses Ubuntu 14.04, but 
static linking fails on 14.04. The standard C library from Ubuntu 
16.04 or later is needed. There may be other/better ways to do 
this, I don't know.

Re: A Friendly Challenge for D

2018-10-16 Thread Jon Degenhardt via Digitalmars-d


On Tuesday, 16 October 2018 at 07:09:05 UTC, Vijay Nayar wrote:
D has multiple compilers, but for the speed of the finished 
binary, LDC2 is generally recommended.  I used version 1.11.0.  
https://github.com/ldc-developers/ldc/releases/tag/v1.11.0


I was using DUB to manage the project, but to build the 
stand-alone file from the gist link, use this command:  $ ldc2 
-release -O3 twinprimes_ssoz.d

And to run it:  $ echo "30" | ./twinprimes_ssoz


It'd be interesting to see if LTO or PGO generated an 
improvement. It looks like it could in this case, as it might 
optimize some of the inner loops. LTO is easy, enable it with:


-flto= -defaultlib=phobos2-ldc-lto,druntime-ldc-lto

(see: https://github.com/ldc-developers/ldc/releases/tag/v1.9.0). 
I've been using 'thin' on OSX, 'full' on Linux.


PGO is a bit more work, but not too bad. A good primer is here: 
https://johanengelen.github.io/ldc/2016/07/15/Profile-Guided-Optimization-with-LDC.html


--Jon

Re: Iain Buclaw at GNU Tools Cauldron 2018

2018-10-07 Thread Jon Degenhardt via Digitalmars-d-announce


On Monday, 8 October 2018 at 05:12:03 UTC, Joakim wrote:

On Sunday, 7 October 2018 at 15:41:43 UTC, greentea wrote:

Date: September 7 to 9, 2018.
Location: Manchester, UK

GDC - D front-end GCC

https://www.youtube.com/watch?v=iXRJJ_lrSxE


Thanks for the link, just watched the whole video. The first 
half-hour sets the standard as an intro to the language, as 
only a compiler developer other than the main implementer could 
give, ie someone with fresh eyes.


I loved that Iain started off with a list of real-world 
projects. That's a mistake a lot of tech talks make, ie not 
motivating _why_ anybody should care about their tech and 
simply diving into the tech itself. I hadn't heard some of that 
info either, great way to begin.


I agree, a very nice talk, including the way the motivation part 
of was handled. I especially liked the example of the group who 
typically used Python for rapid prototyping, then re-wrote in C++ 
for production, who upon trying D for a prototype, were 
pleasantly surprised it was performant enough for production.

Re: Error: variable 'xyz' has scoped destruction, cannot build closure

2018-10-05 Thread Jon Degenhardt via Digitalmars-d-learn


On Friday, 5 October 2018 at 16:34:32 UTC, Paul Backus wrote:
On Friday, 5 October 2018 at 06:56:49 UTC, Nicholas Wilson 
wrote:
On Friday, 5 October 2018 at 06:44:08 UTC, Nicholas Wilson 
wrote:
Alas is does not because each does not accept additional 
argument other than the range. Shouldn't be hard to fix 
though.


https://issues.dlang.org/show_bug.cgi?id=19287


You can thread multiple arguments through to `each` using 
`std.range.zip`:


tenRandomNumbers
.zip(repeat(output))
.each!(unpack!((n, output) => 
output.appendln(n.to!string)));


Full code: https://run.dlang.io/is/Qe7uHt


Very interesting, thanks. It's a clever way to avoid the delegate 
capture issue.


(Aside: A nested function that accesses 'output' from lexical 
context has the same issue as delegates wrt to capturing the 
variable.)

Re: Error: variable 'xyz' has scoped destruction, cannot build closure

2018-10-05 Thread Jon Degenhardt via Digitalmars-d-learn


On Friday, 5 October 2018 at 06:44:08 UTC, Nicholas Wilson wrote:
On Friday, 5 October 2018 at 06:22:57 UTC, Nicholas Wilson 
wrote:
tenRandomNumbers.each!((n,o) => 
o.appendln(n.to!string))(output);


or

tenRandomNumbers.each!((n, ref o) => 
o.appendln(n.to!string))(output);


should hopefully do the trick (run.dlang.io seems to be down 
atm).




Alas is does not because each does not accept additional 
argument other than the range. Shouldn't be hard to fix though.


Yeah, that's what I was seeing also. Thanks for taking a look. Is 
there perhaps a way to limit the scope of the delegate to the 
local function? Something that would tell the compiler the 
delegate has a lifetime shorter than the struct.


One specific it points out is that this a place where the 
BufferedOutputRange I wrote cannot be used interchangeably with 
other output ranges. It's minor, but the intent was to be able to 
pass this anyplace an output range could be used.

Error: variable 'xyz' has scoped destruction, cannot build closure

2018-10-04 Thread Jon Degenhardt via Digitalmars-d-learn

I got the compilation error in the subject line when trying to 
create a range via std.range.generate. Turns out this was caused 
by trying to create a closure for 'generate' where the closure 
was accessing a struct containing a destructor.


The fix was easy enough: write out the loop by hand rather than 
using 'generate' with a closure. What I'm wondering/asking is if 
there alternate way to do this that would enable the 'generate' 
approach. This is more curiosity/learning at this point.


Below is a stripped down version of what I was doing. I have a 
struct for output buffering. The destructor writes any data left 
in the buffer to the output stream. This gets passed to routines 
performing output. It was this context that I created a generator 
that wrote to it.


example.d-
struct BufferedStdout
{
import std.array : appender;

private auto _outputBuffer = appender!(char[]);

~this()
{
import std.stdio : write;
write(_outputBuffer.data);
_outputBuffer.clear;
}

void appendln(T)(T stuff)
{
import std.range : put;
put(_outputBuffer, stuff);
put(_outputBuffer, "\n");
}
}

void foo(BufferedStdout output)
{
import std.algorithm : each;
import std.conv : to;
import std.range: generate, takeExactly;
import std.random: Random, uniform, unpredictableSeed;

auto randomGenerator = Random(unpredictableSeed);
auto randomNumbers = generate!(() => uniform(0, 1000, 
randomGenerator));

auto tenRandomNumbers = randomNumbers.takeExactly(10);
tenRandomNumbers.each!(n => output.appendln(n.to!string));
}

void main(string[] args)
{
foo(BufferedStdout());
}
End of  example.d-

Compiling the above results in:

   $ dmd example.d
   example.d(22): Error: variable `example.foo.output` has scoped 
destruction, cannot build closure


As mentioned, using a loop rather than 'generate' works fine, but 
help with alternatives that would use generate would be 
appreciated.


The actual buffered output struct has more behind it than shown 
above, but not too much. For anyone interested it's here:  
https://github.com/eBay/tsv-utils/blob/master/common/src/tsvutil.d#L358

Re: More fun with autodecoding

2018-09-09 Thread Jon Degenhardt via Digitalmars-d

On Saturday, 8 September 2018 at 15:36:25 UTC, Steven 
Schveighoffer wrote:

On 8/9/18 2:44 AM, Walter Bright wrote:

On 8/8/2018 2:01 PM, Steven Schveighoffer wrote:
Here's where I'm struggling -- because a string provides 
indexing, slicing, length, etc. but Phobos ignores that. I 
can't make a new type that does the same thing. Not only 
that, but I'm finding the specializations of algorithms only 
work on the type "string", and nothing else.


One of the worst things about autodecoding is it is special, 
it *only* steps in for strings. Fortunately, however, that 
specialness enabled us to save things with byCodePoint and 
byCodeUnit.


So it turns out that technically the problem here, even though 
it seemed like an autodecoding problem, is a problem with 
splitter.


splitter doesn't deal with encodings of character ranges at all.


This could partially explain why when I tried byCodeUnit and 
friends awhile ago I concluded it wasn't a reasonable approach: 
splitter is in the middle of much of what I've written.


Even if splitter is changed I'll still be very doubtful about the 
byCodeUnit approach as a work-around. An automated way to 
validate that it is engaged only when necessary would be very 
helpful (@noautodecode perhaps? :))


--Jon

Re: This is why I don't use D.

2018-09-05 Thread Jon Degenhardt via Digitalmars-d

On Wednesday, 5 September 2018 at 16:26:14 UTC, rikki cattermole 
wrote:

On 06/09/2018 4:19 AM, H. S. Teoh wrote:
On Wed, Sep 05, 2018 at 09:34:14AM -0600, Jonathan M Davis via 
Digitalmars-d wrote:
On Wednesday, September 5, 2018 9:28:38 AM MDT H. S. Teoh via 
Digitalmars-d

wrote:

[...]
Also, if the last working compiler version is prominently 
displayed e.g.
in the search results, it will inform people about the 
maintenance state
of that package, which could factor in their decision to use 
that
package or find an alternative.  It will also inform people 
about

potential breakages before they upgrade their compiler.

It doesn't solve all the problems, but it does seem like a 
good initial

low-hanging fruit that shouldn't be too hard to implement.


Alternatively we can let dub call home for usage with CI 
systems and register it having been tested for a given compiler 
on a specific tag.


A possibility might be to let package owners specify one of the 
build status badges commonly added to README files when 
registering the DUB package. Then display the badge in the 
code.dlang.org pages (home page, search result page). It would of 
course be better to display the latest compiler version tested, 
but repurposing existing badges might be simpler and provide some 
value until a more sophisticated scheme can be implemented.


--Jon

Re: tupleof function parameters?

2018-08-28 Thread Jon Degenhardt via Digitalmars-d-learn

On Tuesday, 28 August 2018 at 06:20:37 UTC, Sebastiaan Koppe 
wrote:
On Tuesday, 28 August 2018 at 06:11:35 UTC, Jon Degenhardt 
wrote:
The goal is to write the argument list once and use it to 
create both the function and the Tuple alias. That way I could 
create a large number of these function / arglist tuple pairs 
with less brittleness.


--Jon


I would probably use a combination of std.traits.Parameters and 
std.traits.ParameterIdentifierTuple.


Parameters returns a tuple of types and 
ParameterIdentifierTuple returns a tuple of strings. Maybe 
you'll need to implement a staticZip to interleave both tuples 
to get the result you want. (although I remember seeing one 
somewhere).


Alex, Sebastiaan - Thanks much, this looks like it should get me 
what I'm looking for. --Jon

tupleof function parameters?

2018-08-28 Thread Jon Degenhardt via Digitalmars-d-learn

I'd like to create a Tuple alias representing a function's 
parameter list. Is there a way to do this?


Here's an example creating a Tuple alias for a function's 
parameters by hand:


import std.typecons: Tuple;

bool fn(string op, int v1, int v2)
{
switch (op)
{
default: return false;
case "<": return v1 < v2;
case ">": return v1 > v2;
}
}

alias fnArgs = Tuple!(string, "op", int, "v1", int, "v2");

unittest
{
auto args = fnArgs("<", 3, 5);
assert(fn(args[]));
}

This is quite useful. I'm wondering if there is a way to create 
the 'fnArgs' alias from the definition of 'fn' without needing to 
manually write out the '(string, "op", int, "v1", int, "v2")' 
sequence by hand. Something like a 'tupleof' operation on the 
function parameter list. Or conversely, define the tuple and use 
it when defining the function.


The goal is to write the argument list once and use it to create 
both the function and the Tuple alias. That way I could create a 
large number of these function / arglist tuple pairs with less 
brittleness.


--Jon

Re: Dicebot on leaving D: It is anarchy driven development in all its glory.

2018-08-26 Thread Jon Degenhardt via Digitalmars-d


On Sunday, 26 August 2018 at 05:55:47 UTC, Pjotr Prins wrote:

Artem wrote Sambamba as a student

https://github.com/biod/sambamba

and it is now running around the world in sequencing centers. 
Many many CPU hours and a resulting huge carbon foot print. The 
large competing C++ samtools project has been trying for 8 
years to catch up with an almost unchanged student project and 
they are still slower in many cases.


[snip]

Note that Artem used the GC and only took GC out for critical 
sections in parallel code. I don't buy these complaints about 
GC.


The complaints about breaking code I don't see that much 
either. Sambamba pretty much kept compiling over the years and 
with LDC/LLVM latest we see a 20% perfomance increase. For free 
(at least from our perspective). Kudos to LDC/LLVM efforts!!


This sounds very similar to my experiences with the tsv 
utilities, on most of the same points (development simplicity, 
comparative performance, GC use, LDC). Data processing apps may 
well be a sweet spot. See my DConf talk for an overview 
(https://github.com/eBay/tsv-utils/blob/master/docs/dconf2018.pdf).


Though not mentioned in the talk, I also haven't had any 
significant issues with new compiler releases. May have be 
related to the type of code being written. Regarding the GC - The 
throughput oriented nature of data processing tools like the tsv 
utilities looks like a very good fit for the current GC. 
Applications where low GC latency is needed may have different 
results. It'd be great to hear an experience report from 
development of an application where GC was used and low GC 
latency was a priority.


--Jon

Re: D is dead (was: Dicebot on leaving D: It is anarchy driven development in all its glory.)

2018-08-23 Thread Jon Degenhardt via Digitalmars-d


On Friday, 24 August 2018 at 00:46:14 UTC, Mike Franklin wrote:
It seems, from someone without much historical perspective, 
that Phobos was intended to be something like the .Net 
Framework for D.  Perhaps there are a few fundamentals 
(std.algorithm, std.allocator, etc.) to keep, but for the 
others... move 'em to Dub and let the "free market" sort it out.


That might work for some use cases, but not for others. For my 
use cases, a rock solid standard library is a basic requirement 
(think STL, Boost, etc). These don't normally come out of a loose 
knit community of individuals, there needs to be some sort of 
organizational presence involved to ensure quality, consistency, 
completeness, etc. If Phobos or an equivalent wasn't available at 
its present level of quality then D wouldn't be in the 
consideration set.


On the other hand, my use-cases don't have the requirements that 
drive other folks towards removing dependence on druntime and 
similar. An individual or organization's prioritization 
preferences will depend on their goals.


--Jon

Re: More fun with autodecoding

2018-08-09 Thread Jon Degenhardt via Digitalmars-d

On Wednesday, 8 August 2018 at 21:01:18 UTC, Steven Schveighoffer 
wrote:
Not trying to give too much away about the library I'm writing, 
but the problem I'm trying to solve is parsing out tokens from 
a buffer. I want to delineate the whole, as well as the parts, 
but it's difficult to get back to the original buffer once you 
split and slice up the buffer using phobos functions.


I wonder if there are some parallels in the tsv utilities I 
wrote. The tsv parser is extremely simple, byLine and splitter on 
a char buffer. Most of the tools just iterate the split result in 
order, but a couple do things like operate on a subset of fields, 
potentially reordered. For these a separate structure is created 
that maps back the to original buffer to avoid copying. Likely 
quite simple compared to what you are doing.


The csv2tsv tool may be more interesting. Parsing is relatively 
simple, mostly identifying field values in the context of CSV 
escape syntax. It's modeled as reading an infinite stream of 
utf-8 characters, byte-by-byte. Occasionally the bytes forming 
the value need to be modified due to the escape syntax, but most 
of the time the characters in the original buffer remain 
untouched and parsing is identifying the start and end positions.


The infinite stream is constructed by reading fixed size blocks 
from the input stream and concatenating them with joiner. This 
eliminates the need to worry about utf-8 characters spanning 
block boundaries, but it comes at a cost: either write 
byte-at-a-time, or make an extra copy (also byte-at-a-time). 
Making an extra copy is faster, that what the code does. But, as 
a practical matter, most of the time large blocks could often be 
written directly from the original input buffer.


If I wanted it make it faster than current I'd do this. But I 
don't see an easy way to do this with phobos ranges. At minimum 
I'd have to be able to run code when the joiner operation hits 
block boundaries. And it'd also be necessary to create a mapping 
back to the original input buffer.


Autodecoding comes into play of course. Basically, splitter on 
char arrays is fine, but in a number of cases it's necessary to 
work using ubtye to avoid the performance penalty.


--Jon

Re: std.experimental.collections.rcstring and its integration in Phobos

2018-07-17 Thread Jon Degenhardt via Digitalmars-d


On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:
So we managed to revive the rcstring project and it's already a 
PR for Phobos:


https://github.com/dlang/phobos/pull/6631 (still WIP though)

The current approach in short:

- uses the new @nogc, @safe and nothrow Array from the 
collections library (check Eduardo's DConf18 talk)

- uses reference counting
- _no_ range by default (it needs an explicit `.by!{d,w,}char`) 
(as in no auto-decoding by default)


[snip]

What do you think about this approach? Do you have a better 
idea?


I don't know the goals/role rcstring is expected to play, 
especially wrt existing string/character facilities. Perhaps you 
could describe more?


Strings are central to many applications, so I'm wondering about 
things like whether rcstring is intended as a replacement for 
string that would be used by most new programs, and whether 
applications would use arrays and ranges of char together with 
rcstring, or rcstring would be used for everything.


Perhaps its too early for these questions, and the current goal 
is simpler. For example, adding a meaningful collection class 
that is @nogc, @safe and ref-counted that be used as a proving 
ground for the newer memory management facilities being developed.


Such simpler goals would be quite reasonable. What's got me 
wondering about the larger questions are the comments about 
ranges and autodecoding. If rcstring is intended as a vehicle for 
general @nogc handling of character data and/or for reducing the 
impact of autodecoding, then it makes sense to consider from 
those perspectives.


--Jon

eBay's TSV Utilities repository renamed

2018-07-15 Thread Jon Degenhardt via Digitalmars-d-announce

I've renamed the TSV Utilities Github repository from 
eBay/tsv-utils-dlang to eBay/tsv-utils. This is to better reflect 
the functional nature of the tools.


Links pointing to the old github repo will be redirected to the 
new repo. This includes git operations like clone, etc., so 
Project Tester should not be affected. Let me know if any issues 
surface.


--Jon

Re: Driving Continuous Improvement in D

2018-06-02 Thread Jon Degenhardt via Digitalmars-d-announce


On Saturday, 2 June 2018 at 07:23:42 UTC, Mike Parker wrote:
In this post for the D Blog, Jack Stouffer details how dscanner 
is used in the Phobos development process to help improve code 
quality and fight entropy.


The blog:
https://dlang.org/blog/2018/06/02/driving-continuous-improvement-in-d/

reddit:
https://www.reddit.com/r/programming/comments/8nyzmk/driving_continuous_improvement_in_d/


Nice post. I haven't tried dscanner on my code, but I plan to 
now. It looks like the documentation on the dscanner repo is 
pretty good. If you think it's ready for wider adoption, consider 
adding a couple lines to the blog post indicating that folks who 
want to try it will find instructions in the repo.

Re: Splitting up large dirty file

2018-05-21 Thread Jon Degenhardt via Digitalmars-d-learn


On Monday, 21 May 2018 at 15:00:09 UTC, Dennis wrote:
I want to be convinced that Range programming works like a 
charm, but the procedural approaches remain more flexible (and 
faster too) it seems. Thanks for the example.



On Monday, 21 May 2018 at 22:11:42 UTC, Dennis wrote:
In this case I used drop to drop lines, not characters. The 
exception was thrown by the joiner it turns out.

 ...
From the benchmarking I did, I found that ranges are easily an 
order of magnitude slower even with compiler optimizations:


My general experience is that range programming works quite well. 
It's especially useful when used to do lazy processing and as a 
result minimize memory allocations. I've gotten quite good 
performance with these techniques (see my DConf talk slides: 
https://dconf.org/2018/talks/degenhardt.html).


Your benchmarks are not against the file split case, but if you 
benchmarked that you may have also seen it as slow. It that case 
you may be hitting specific areas where there are opportunities 
for performance improvement in the standard library. One is that 
joiner is slow (PR: https://github.com/dlang/phobos/pull/6492). 
Another is that the write[fln] routines are much faster when 
operating on a single large object than many small objects. e.g. 
It's faster to call write[fln] with an array of 100 characters 
than: (a) calling it 100 times with one character; (b) calling it 
once, with 100 characters as individual arguments (template 
form); (c) calling it once with range of 100 characters, each 
processed one at a time.


When joiner is used as in your example, you not only hit the 
joiner performance issue, but the write[fln] issue. This is due 
to something that may not be obvious at first: When joiner is 
used to concatenate arrays or ranges, it flattens out the 
array/range into a single range of elements. So, rather than 
writing a line at a time, you example is effectively passing a 
character at a time to write[fln].


So, in the file split case, using byLine in an imperative fashion 
as in my example will have the effect of passing a full line at a 
time to write[fln], rather than individual characters. Mine will 
be faster, but not because it's imperative. The same thing could 
be achieved procedurally.


Regarding the benchmark programs you showed - This is very 
interesting. It would certainly be worth additional looks into 
this. One thing I wonder is if the performance penalty may be due 
to a lack of inlining due to crossing library boundaries. The 
imperative versions aren't crossing these boundaries. If you're 
willing, you could try adding LDC's LTO options and see what 
happens. There are some instructions in the release notes for LDC 
1.9.0 (https://github.com/ldc-developers/ldc/releases). Make sure 
you use the form that includes druntime and phobos.


--Jon

Re: Splitting up large dirty file

2018-05-17 Thread Jon Degenhardt via Digitalmars-d-learn


On Thursday, 17 May 2018 at 20:08:09 UTC, Dennis wrote:

On Wednesday, 16 May 2018 at 15:47:29 UTC, Jon Degenhardt wrote:
If you write it in the style of my earlier example and use 
counters and if-tests it will work. byLine by itself won't try 
to interpret the characters (won't auto-decode them), so it 
won't trigger an exception if there are invalid utf-8 
characters.


When printing to stdout it seems to skip any validation, but 
writing to a file does give an exception:


```
auto inputStream = (args.length < 2 || args[1] == "-") ? 
stdin : args[1].File;

auto outputFile = new File("output.txt");
foreach (line; inputStream.byLine(KeepTerminator.yes)) 
outputFile.write(line);

```
std.exception.ErrnoException@C:\D\dmd2\windows\bin\..\..\src\phobos\std\stdio.d(2877):
  (No error)

According to the documentation, byLine can throw an 
UTFException so relying on the fact that it doesn't in some 
cases doesn't seem like a good idea.


Instead of:

 auto outputFile = new File("output.txt");

try:

auto outputFile = File("output.txt", "w");

That works for me. The second arg ("w") opens the file for write. 
When I omit it, I also get an exception, as the default open mode 
is for read:


 * If file does not exist:  Cannot open file `output.txt' in mode 
`rb' (No such file or directory)

 * If file does exist:   (Bad file descriptor)

The second error presumably occurs when writing.

As an aside - I agree with one of your bigger picture 
observations: It would be preferable to have more control over 
utf-8 error handling behavior at the application level.

Re: Splitting up large dirty file

2018-05-16 Thread Jon Degenhardt via Digitalmars-d-learn


On Wednesday, 16 May 2018 at 07:06:45 UTC, Dennis wrote:

On Wednesday, 16 May 2018 at 02:47:50 UTC, Jon Degenhardt wrote:
Can you show the program you are using that throws when using 
byLine?


Here's a version that only outputs the first chunk:
```
import std.stdio;
import std.range;
import std.algorithm;
import std.file;
import std.exception;

void main(string[] args) {
enforce(args.length == 2, "Pass one filename as argument");
	auto lineChunks = File(args[1], 
"r").byLine.drop(4).chunks(10_000_000/10);

new File("output.txt", "w").write(lineChunks.front.joiner);
}
```


If you write it in the style of my earlier example and use 
counters and if-tests it will work. byLine by itself won't try to 
interpret the characters (won't auto-decode them), so it won't 
trigger an exception if there are invalid utf-8 characters.

Re: Splitting up large dirty file

2018-05-15 Thread Jon Degenhardt via Digitalmars-d-learn


On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:

I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb 
would fit but I get an out of memory error when using 
std.file.read)
- It is dirty (contains invalid Unicode characters, null bytes 
in the middle of lines)


I want to write a program that splits it up into multiple 
files, with the splits happening every n lines. I keep 
encountering roadblocks though:


- You can't give Yes.useReplacementChar to `byLine` and 
`byLine` (or `readln`) throws an Exception upon encountering an 
invalid character.


Can you show the program you are using that throws when using 
byLine? I tried a very simple program that reads and outputs 
line-by-line, then fed it a file that contained invalid utf-8. I 
did not see an exception. The invalid utf-8 was created by taking 
part of this file: 
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (a 
commonly used file with utf-8 edge cases), plus adding a number 
of random hex characters, including null. I don't see exceptions 
thrown.


The program I used:

int main(string[] args)
{
import std.stdio;
import std.conv : to;
try
{
auto inputStream = (args.length < 2 || args[1] == "-") ? 
stdin : args[1].File;
foreach (line; inputStream.byLine(KeepTerminator.yes)) 
write(line);

}
catch (Exception e)
{
stderr.writefln("Error [%s]: %s", args[0], e.msg);
return 1;
}
return 0;
}

Re: iopipe v0.0.4 - RingBuffers!

2018-05-11 Thread Jon Degenhardt via Digitalmars-d-announce

On Friday, 11 May 2018 at 15:44:04 UTC, Steven Schveighoffer 
wrote:

On 5/10/18 7:22 PM, Steven Schveighoffer wrote:

Shameful note: Macos grep is BSD grep, and is not NEARLY as 
fast as GNU grep, which has much better performance (and is 2x 
as fast as iopipe_search on my Linux VM, even when printing 
line numbers).


Yeah, the MacOS default versions of the Unix text processing 
tools are really slow. It's worth installing the GNU versions if 
doing performance comparisons on MacOS, or because you work with 
large files. Homebrew and MacPorts both have the GNU versions. 
Some relevant packages: coreutils, grep, gsed (sed), gawk (awk).


Most tools are in coreutils. Many will be installed with a 'g' 
prefix by default, leaving the existing tools in place. e.g. 
'cut' will be installed as 'gcut' unless specified otherwise.


--Jon

Re: Things to do in Munich

2018-05-01 Thread Jon Degenhardt via Digitalmars-d-announce


On Monday, 30 April 2018 at 19:57:10 UTC, Seb wrote:
As I live in Munich and there have been a few threads about 
things to do in Munich, I thought I quickly share a few 
selected activities + current events.


- over 80 museums (best ones: Museum Brandhost, Pinakothek der 
Moderne, Haus der Kunst, Deutsches Museum, Glyptothek, potato 
museum, NS-


Most of the museums are closed today (public holiday). Check 
before you go. However, the surfers are out!


—Jon

Re: Am I reading this wrong, or is std.getopt really this stupid?

2018-03-24 Thread Jon Degenhardt via Digitalmars-d

On Saturday, 24 March 2018 at 16:11:18 UTC, Andrei Alexandrescu
wrote:
Anyhow. Right now the order of processing is the same as the
lexical order in which flags are passed to getopt. There may be
use cases for which that's the more desirable way to go about
things, so if you author a PR to change the order you'd need to
build an argument on why command-line order is better. FWIW the
traditional POSIX doctrine makes behavior of flags independent
of their order, which would imply the current choice is more
natural.

Several of the TSV tools I built rely on command-line order.
There is an enhancement request here:
https://issues.dlang.org/show_bug.cgi?id=16539.

A few of the tools use a paradigm where the user is entering a
series instructions on the command line, and there are times when
the user entered order matters. Two general cases:

* Display/output order - The tool produces delimited output, and
the user wants to control the order. The order of command line
options determines the order.

* Short-circuiting - tsv-filter in particular allows numeric
tests like less-than, but also allow the user to short-circuit
the test by testing if the data contains a valid number prior to
making the numeric test. This is done by evaluating the command
line arguments in left-to-right order.

Short-circuiting is supported the Unix `find` utility.

I have used this approach for CLI tools I've written in other
languages. Perl's Getopt::Long processes args in command-line, so
it supports this.

I considered submitting a PR to getopt to change this, but
decided against it. The approach used looks like it is central to
the design, and changing it in a backward compatible way would be
a meaningful undertaking. Instead I wrote a cover to getopt that
processes arguments in command-line order. It is here:
https://github.com/eBay/tsv-utils-dlang/blob/master/common/src/getopt_inorder.d. It handles most of what std.getopt handles.

The TSV utilities documentation should help illustrate these
cases. tsv-filter use short circuiting:
https://github.com/eBay/tsv-utils-dlang/blob/master/docs/ToolReference.md#tsv-filter-reference. Look for "Short-circuiting expressions" toward the bottom of the section.

tsv-summarize obeys the command-line order for output/display.
See:
https://github.com/eBay/tsv-utils-dlang/blob/master/docs/ToolReference.md#tsv-summarize-reference.

There's one other general limitation I encountered with the
current compile-time approach to command-line argument
processing. I couldn't find a clean way to allow it to be
extended in a plug-in manner.

In particular, the original goal for the tsv-summarize tool was
to allow users to create custom operators. The tool has a fair
number of built-in operators, like median, sum, min, max, etc.
Each of these operators has a getopt arg invoking it, eg.
'--median', '--sum', etc. However, it is common for people to
have custom analysis needs, so allowing extension of the set
would be quite useful.

The code is setup to allow this. People would clone the repo,
write their own operator, placed in a separate file they
maintain, and rebuild. However, I couldn't figure out a clean way
to allow additions to command line argument set. There may be a
reasonable way and I just couldn't find it, but my current
thinking is that I need to write my own command line argument
handler to support this idea.

I think handling command line argument processing at run-time
would make this simpler, at the cost loosing some compile-time
validation.

--Jon

Re: Why not flag away the mistakes of the past?

2018-03-07 Thread Jon Degenhardt via Digitalmars-d


On Wednesday, 7 March 2018 at 16:33:25 UTC, Seb wrote:
On Wednesday, 7 March 2018 at 15:26:40 UTC, Jon Degenhardt 
wrote:
On Wednesday, 7 March 2018 at 06:00:30 UTC, Taylor Hillegeist 
wrote:

[...]


Auto-decoding is a significant issue for the applications I 
work on (search engines). There is a lot of string 
manipulation in these environments, and performance matters. 
Auto-decoding is a meaningful performance hit. Otherwise, 
Phobos has a very nice collection of algorithms for string 
manipulation. It would be great to have a way to turn 
auto-decoding off in Phobos.


Well you can use byCodeUnit, which disables auto-decoding

Though it's not well-known and rather annoying to explicitly 
add it almost everywhere.


I looked at this once. It didn't appear to be a viable solution, 
though I forget the details. I can probably resurrect them if 
that would be helpful.

Re: Why not flag away the mistakes of the past?

2018-03-07 Thread Jon Degenhardt via Digitalmars-d

On Wednesday, 7 March 2018 at 06:00:30 UTC, Taylor Hillegeist 
wrote:
So i've seen on the forum over the years arguments about 
auto-decoding (mostly) and some other things. Things that have 
been considered mistakes, and cannot be corrected because of 
the breaking changes it would create. And I always wonder why 
not make a solution to the tune of a flag that makes things 
work as they used too, and make the new behavior default.


dmd --UseAutoDecoding

That way the breaking change was easily fixable, and the 
mistakes of the past not forever. Is it just the cost of 
maintenance?


Auto-decoding is a significant issue for the applications I work 
on (search engines). There is a lot of string manipulation in 
these environments, and performance matters. Auto-decoding is a 
meaningful performance hit. Otherwise, Phobos has a very nice 
collection of algorithms for string manipulation. It would be 
great to have a way to turn auto-decoding off in Phobos.


--Jon

Re: Project Highlight: The D Community Hub

2018-02-18 Thread Jon Degenhardt via Digitalmars-d-announce


On Saturday, 17 February 2018 at 12:56:34 UTC, Mike Parker wrote:
In case you aren't aware of the dlang-community organization at 
GitHub, it's an umbrella group of contributors working to keep 
certain D projects alive and updated. Sebastian Wilzbach filled 
me in on some details for the latest Project Highlight on the 
blog.


blog:
https://dlang.org/blog/2018/02/17/project-highlight-the-d-community-hub/

reddit:
https://www.reddit.com/r/programming/comments/7y6gw1/the_d_community_hub_an_umbrella_group_for_d/


Very nice article. There are more projects there than I had 
realized!

Re: OT: Photo of a single atom by David Nadlinger wins top prize

2018-02-14 Thread Jon Degenhardt via Digitalmars-d


On Tuesday, 13 February 2018 at 23:09:07 UTC, Ali Çehreli wrote:

David (aka klickverbot) is a longtime D contributor.


https://www.epsrc.ac.uk/newsevents/news/single-trapped-atom-captures-science-photography-competitions-top-prize/

Ali


More than cool!! Congrats David!

Re: Which language futures make D overcompicated?

2018-02-10 Thread Jon Degenhardt via Digitalmars-d


On Friday, 9 February 2018 at 07:54:49 UTC, Suliman wrote:

Which language futures by your opinion make D harder?


For me, one of the attractive qualities of D is its relative 
simplicity. Key comparison points are C++, Scala, and Python. 
Python being the simplest, then D, not far off, with Scala and 
C++ being more complex. Entirely subjective, not measured in any 
empirical way.


That said, a couple of D constructs that I personally find 
increases friction:


* Static arrays aren't not ranges. I continually forget to slice 
them when I want to use them as ranges. The compiler errors are 
often complex template instantiation failure messages.


* Template instantiation failures - It takes longer than I'd like 
to figure out why a template failed to instantiate. This is 
especially true when there are multiple overloads, each with 
multiple template constraints.


* Auto-decoding - Mentioned by multiple people. It's mainly an 
issue after you've decided you need to avoid it. Figuring out how 
out to utilize Phobos routines without having them engage 
auto-decoding on your behalf is challenging.


--Jon

OT: Indexing reordering in the eBay Search Engine

2018-01-19 Thread Jon Degenhardt via Digitalmars-d

If anyone is interested in the type of work that goes on in my 
group at eBay, take a look at this blog post by one of my 
colleagues: 
https://www.ebayinc.com/stories/blogs/tech/making-e-commerce-search-faster/


It describes a 25% efficiency gain via a technique called index 
reordering. This is the engineering side of the work, I also work 
on recall and ranking.


--Jon

Re: I closed a very old bug!

2018-01-18 Thread Jon Degenhardt via Digitalmars-d

On Thursday, 18 January 2018 at 07:46:03 UTC, Andrei Alexandrescu 
wrote:
There's been some discussion about what to do with issues that 
propose enhancements like this. We want to make them available 
and searchable just in case someone working on a related 
proposal is looking for precedent and inspiration.


I was thinking of closing with REMIND or LATER. Seb is 
experimenting with moving the entire bug database to github 
issues, which may offer us more options for classification.


It would make sense to separate bugs from enhancements in this 
regard. It's useful to record and maintain useful enhancements 
ideas even if they don't fit the current priorities. There are 
multiple ways to implement this, but it'd be most useful if the 
distinction between "bugs" and "enhancements" is obvious and easy 
to discover.


--Jon

Re: TSV Utilities release with LTO and PGO enabled

2018-01-17 Thread Jon Degenhardt via Digitalmars-d-announce

On Wednesday, 17 January 2018 at 21:49:52 UTC, Johan Engelen 
wrote:
On Wednesday, 17 January 2018 at 04:37:04 UTC, Jon Degenhardt 
wrote:


Clearly personal judgment played a role. However, the tools 
are reasonably task focused, and I did take basic steps to 
ensure the benchmark data and tests were separate from the 
training data/tests. For these reasons, my confidence is good 
that the results are reasonable and well founded.


Great, thanks for the details, I agree.
Hope it's useful for others to see these details.


Thanks Johan, much appreciated. :)

(btw, did you also check the performance gains when using the 
profile of the benchmark itself, to learn about the upper-bound 
of PGO for your program?)


I'll merge the IR PGO addition into LDC master soon. Don't know 
what difference it'll make.


No, I didn't do an upper-bounds check, that's a good idea. I plan 
to test the IR based PGO when it's available, I'll run an 
upper-bounds check as part of it.

Re: TSV Utilities release with LTO and PGO enabled

2018-01-16 Thread Jon Degenhardt via Digitalmars-d-announce

On Tuesday, 16 January 2018 at 22:04:52 UTC, Johan Engelen wrote:
Because PGO optimizes for the given profile, it would help a
lot if you clarified how you do your PGO benchmarking. What
kind of test load profile you used for optimization and what
test load you use for the time measurement.

The profiling used is checked into the repo and run as part of a
PGO build, it is available for inspection. The benchmarks used
for deltas are also documented, they the ones used in the
benchmark comparison to similar tools done in March 2017. This
report is in the repo
(https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md).

However, it's hard to imagine anyone perusing the repo for this
stuff, so I'll try to summarize what I did below.

Benchmarks - Six different tests of rather different but common
operations run on large data files. The six tests were chosen
because for each I was able to find at least three other tools,
written in native compiled languages, with similar functionality.
There are other valuable benchmarks, but I haven't published them.

Profiling - Profiling was developed separately for each tool. For
each I generated several data files with data representative of
typical uses cases. Generally numeric or text data in several
forms and distributions. The data was unrelated to the data used
in benchmarks, which is from publicly available machine learning
data sets. However, personal judgement was used in the generation
of the data sets, so it's not free from bias.

After generating the data, I generated a set of run options
specific to each tool. As an example, tsv-filter selects data
file lines based on various numeric and text criteria (e.g.
less-than). There are a bit over 50 comparison operations, plus a
few meta operations. The profiling runs ensure all the operations
are run at least once, but that the most important overweighted.
The ldc.profile.resetAll call was used to exclude all the initial
setup code (command line argument processing). This was nice
because it meant the data files could be small relative to
real-world sets, and it runs fast enough to do at part of the
build step (ie. on Travis-CI).

Look at
https://github.com/eBay/tsv-utils-dlang/tree/master/tsv-filter/profile_data to see a concrete example (tsv-filter). In that directory are five data files and a shell script that runs the commands and collects the data.

This was done for four of the tools covering five of the
benchmarks. I skipped one of the tools (tsv-join), as it's harder
to come up with a concise set of profile operations for it.

I then ran the standard benchmarks I usually report on in various
D venues.

Clearly personal judgment played a role. However, the tools are
reasonably task focused, and I did take basic steps to ensure the
benchmark data and tests were separate from the training
data/tests. For these reasons, my confidence is good that the
results are reasonable and well founded.

--Jon

Re: TSV Utilities release with LTO and PGO enabled

2018-01-15 Thread Jon Degenhardt via Digitalmars-d-announce

On Tuesday, 16 January 2018 at 00:19:24 UTC, Martin Nowak wrote:
On Sunday, 14 January 2018 at 23:18:42 UTC, Jon Degenhardt
wrote:
Combined, LTO and PGO resulted in performance improvements
greater than 25% on three of my standard six benchmarks, and
five of the six improved at least 8%.

Yay, I'm usually seeing double digit improvements for PGO
alone, and single digit improvements for LTO. Meaning PGO has
more effect even though LTO seems to be the more hyped one.

Have you bothered benchmarking them separately?

Last spring I made a few quick tests of both separately. That was
just against the app code, without druntime/phobos. Saw some
benefit from LTO, mainly one of the tools, and not much from PGO.

More recently I tried LTO standalone and LTO plus PGO, both
against app code and druntime/phobos, but not PGO standalone. The
LTO benchmarks are here:
https://github.com/eBay/tsv-utils-dlang/blob/master/docs/dlang-meetup-14dec2017.pdf. I've haven't published the LTO + PGO benchmarks.

The takeaway from my tests is that LTO and PGO will benefit
different apps differently, perhaps in ways not easily predicted.
One of my tools benefited primarily from PGO, two primarily from
LTO, and one materially from both. So, it is worth trying both.

For both, the big win was from optimizing across app code and
libs (druntime/phobos in my case). It'd be interesting to see if
other apps see similar behavior, either with phobos/druntime or
other libraries, perhaps libraries from dub dependencies.

TSV Utilities release with LTO and PGO enabled

2018-01-14 Thread Jon Degenhardt via Digitalmars-d-announce

I just released a new version of eBay's TSV Utilities. The cool 
thing about the release is not about changes in toolkit, but that 
it was possible to build everything using LDC's support for Link 
Time Optimization (LTO) and Profile Guided Optimization (PGO). 
This includes running the optimizations on both the application 
code and the D standard libraries (druntime and phobos). Further, 
it was all doable on Travis-CI (Linux and MacOS), including 
building release binaries available from the GitHub release page.


Combined, LTO and PGO resulted in performance improvements 
greater than 25% on three of my standard six benchmarks, and five 
of the six improved at least 8%.


Release info: 
https://github.com/eBay/tsv-utils-dlang/releases/tag/v1.1.16

Re: DLang docker images for CircleCi 2.0

2018-01-05 Thread Jon Degenhardt via Digitalmars-d-announce


On Wednesday, 3 January 2018 at 13:12:48 UTC, Seb wrote:

tl;dr: you can now use special D docker images for CircleCi 2.0

[snip

PS: I'm aware of Stefan Rohe's great D Docker images [1], but 
this Docker image is built on top of the specialized CircleCi 
image (e.g. for their SSH login).


One useful characteristic of Stefan's images is that the 
Dockerhub pages include the Dockerfile and github repository 
links. I don't know what it takes to include them. It does make 
it easier to see exactly what the configuration is, find the 
repo, and even create PRs against them. Would be useful if they 
can be added to the CircleCI image pages.


My interest in this case - I use Stefan's LDC image in Travis-CI 
builds. Building the runtime libraries with LTO/PGO requires the 
ldc-build-runtime tool, which in turn requires a few additional 
things in the docker image, like cmake or ninja. I was interested 
if they might have been included in the CircleCI images as well. 
(Doesn't appear so.)

Re: Article: Finding memory bugs in D code with AddressSanitizer

2017-12-26 Thread Jon Degenhardt via Digitalmars-d-announce


On Monday, 25 December 2017 at 17:03:37 UTC, Johan Engelen wrote:
I've been writing this article since August, and finally found 
some time to finish it:


http://johanengelen.github.io/ldc/2017/12/25/LDC-and-AddressSanitizer.html

"LDC comes with improved support for Address Sanitizer since 
the 1.4.0 release. Address Sanitizer (ASan) is a runtime memory 
write/read checker that helps discover and locate memory access 
bugs. ASan is part of the official LDC release binaries; to use 
it you must build with -fsanitize=address. In this article, 
I’ll explain how to use ASan, what kind of bugs it can find, 
and what bugs it will be able to find in the (hopefully near) 
future."


Nice article. Main question / comment is about the need for 
blacklisting D standard libraries (druntime/phobos). If someone 
wants to try ASan out on their own code, can they start by 
ignoring the D standard libraries? And, for programs that use 
druntime/phobos, will this be effective? If I understand the 
post, the answer is "yes", but I think it could be more explicit.


Second comment is related - If the reader was to try 
instrumenting druntime/phobos along with their own code, how much 
effort should be expected to correctly blacklist druntime/phobos 
code? Would many programs have smooth sailing if they took the 
blacklist published in the post? Or is this early stage enough 
that some real effort should be expected?


Also, if the blacklist file in the post represents a meaningful 
starting point, perhaps it makes sense to check it in and 
distribute it. This would provide a place for contributors to 
start making improvements.

Re: Silicon Valley D Meetup - December 14, 2017 - "Experimenting with Link Time Optimization" by Jon Degenhardt

2017-12-20 Thread Jon Degenhardt via Digitalmars-d-announce

On Saturday, 16 December 2017 at 11:52:37 UTC, Johan Engelen 
wrote:

Clearly very interested in what your PGO testing will show. :-)


Early returns on adding PGO on top of LTO (first five benchmarks 
in the slide deck, tsv-join not tested):

* Two meaningful improvements:
  - csv2tsv: Linux: 8%; macOS: 33%
  - tsv-summarize: Linux: 6%; macOS: 11%
* Minor improvements on the other three benchmarks (< 5%)

Overall, for LDC 1.5, the improvements going from a normal 
optimized build to one combining LTO and PGO ranged from on 8-45% 
Linux, and 6-57% on macOS. (First five benchmarks, excluding 
tsv-join). Impressive!


--Jon

Re: Silicon Valley D Meetup - December 14, 2017 - "Experimenting with Link Time Optimization" by Jon Degenhardt

2017-12-16 Thread Jon Degenhardt via Digitalmars-d-announce

On Saturday, 16 December 2017 at 11:52:37 UTC, Johan Engelen 
wrote:

On Friday, 15 December 2017 at 03:08:35 UTC, Ali Çehreli wrote:

This should be live now:

  http://youtu.be/e05QvoKy_8k


Great! I've added some comments there, pasted here:


Fantastic feedback! Fills in some really important details.

Can't wait to see the results of LTO on Weka.io's (LARGE) 
applications. Work in progress...!


Agreed. It'd be great to see the experience of a few more apps.

Could you add the reference links in the comment section there 
too? (can't click on blue links in the video ;-)


Done. Thanks for pointing this out. I also updated the posted 
slide deck so that the hyperlinks work after downloading it. 
(They still aren't clickable in the GitHub inline viewer.)



Clearly very interested in what your PGO testing will show. :-)


Yes, should be interesting. Promising results in one benchmark. 
And sigh, I forgot to mention the opportunity you mentioned for 
someone to participate: Adding LLVM's IR-level PGO to the LDC 
compiler. Sounds pretty cool.

Re: Silicon Valley D Meetup - December 14, 2017 - "Experimenting with Link Time Optimization" by Jon Degenhardt

2017-12-15 Thread Jon Degenhardt via Digitalmars-d-announce


On Friday, 15 December 2017 at 03:08:35 UTC, Ali Çehreli wrote:

This should be live now:

  http://youtu.be/e05QvoKy_8k

Ali

On 11/21/2017 11:58 AM, Ali Çehreli wrote:

Meetup page: 
https://www.meetup.com/D-Lang-Silicon-Valley/events/245288287/


LDC[1], the LLVM-based D compiler, has been adding Link Time 
Optimization capabilities over the last several releases. [...]


This talk will look at the results of applying LTO to one set 
of applications, eBay's TSV utilities[2]. [...]


Jon Degenhardt is a member of eBay's Search Science team.
[...] D quickly became his favorite programming language, one 
he uses whenever he can.


Ali

[1] 
https://github.com/ldc-developers/ldc#ldc--the-llvm-based-d-compiler


[2] 
https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/


Slides from the talk: 
https://github.com/eBay/tsv-utils-dlang/blob/master/docs/dlang-meetup-14dec2017.pdf

Re: What's the proper way to use std.getopt?

2017-12-12 Thread Jon Degenhardt via Digitalmars-d-learn

On Monday, 11 December 2017 at 20:58:25 UTC, Jordi Gutiérrez 
Hermoso wrote:
What's the proper style, then? Can someone show me a good 
example of how to use getopt and the docstring it automatically 
generates?


The command line tools I published use the approach described in 
a number of the replies, but with a tad more structure. It's 
hardly perfect, but may be useful if you want more examples. See: 
 
https://github.com/eBay/tsv-utils-dlang/blob/master/tsv-sample/src/tsv-sample.d. See the main() routine and the TsvSampleOptions struct. Most of the tools have a similar pattern.


--Jon

Re: Thoughts about D

2017-11-29 Thread Jon Degenhardt via Digitalmars-d


On Wednesday, 29 November 2017 at 16:57:36 UTC, H. S. Teoh wrote:
While generally I would still use fullblown D rather than 
BetterC for my projects, the bloat from druntime/phobos does 
still bother me at the back of my mind.  IIRC, the Phobos docs 
used to state that the philosophy for Phobos is pay-as-you-go. 
As in, if you don't use feature X, the code and associated data 
that implements feature X shouldn't even appear in the 
executable. It seems that we have fallen away from that for a 
while now.  Perhaps it's time to move D back in that direction.


If there specific apps where druntime and/or phobos bloat is 
thought to be too high, it might be worth trying the new LDC 
support for building a binary with druntime and phobos compiled 
with LTO (Link Time Optimization). I saw reduced binary sizes on 
my apps, it'd be interesting to hear other experiences.

Re: Thoughts about D

2017-11-26 Thread Jon Degenhardt via Digitalmars-d


On Monday, 27 November 2017 at 00:14:40 UTC, IM wrote:
I'm a full-time C++ software engineer in Silicon Valley. I've 
been learning D and using it in a couple of personal side 
projects for a few months now.


First of all, I must start by saying that I like D, and wish to 
use it everyday. I'm even considering to donate to the D 
foundation. However, some of D features and design decisions 
frustrates me a lot, and sometimes urges me to look for an 
alternative. I'm here not to criticize, but to channel my 
frustrations to whom it may concern. I want D to become better 
and more widely used. I'm sure many others might share with me 
some of the following points:


Forum discussions are valuable venue. Since you are in Silicon 
Valley, you might also consider attending one of the Silicon 
Valley D meetups (https://www.meetup.com/D-Lang-Silicon-Valley). 
It's hard to beat face-to-face conversations with other 
developers to get a variety of perspectives. The ultimate would 
be DConf, if you can manage to attend.

1 2 3 >

1 - 100 of 200 matches

Mail list logo