Re: Faster Command Line Tools in D

Patrick Schluter via Digitalmars-d-announce Tue, 30 May 2017 22:11:56 -0700

On Tuesday, 30 May 2017 at 22:31:50 UTC, Steven Schveighofferwrote:

On 5/30/17 5:57 PM, Patrick Schluter wrote:
On Tuesday, 30 May 2017 at 21:18:42 UTC, Steven Schveighofferwrote:
On 5/26/17 11:20 AM, John Colvin wrote:
On Friday, 26 May 2017 at 14:41:39 UTC, John Colvin wrote:
[...]
This version also has the advantage of being (discountingany bugs iniopipe) correct for arbitrary unicode in all common UTFencodings.
I worked a lot on making sure this works properly. However,it's
possible that there are some lingering issues.
I also did not spend much time optimizing these paths(whereas I spenta ton of time getting the utf8 line parsing as fast as itcould be).Partly because finding things other than utf8 in the wild israre, andpartly because I have nothing to compare it with to know whatis
possible :)
If you want UCS-2 (aka UTF-16 without surrogates) data I cangive you
gigabytes of files in tmx format.
The data I can (and have) generated from UTF-8 data. I havetested my byLine parser to make sure it properly splits on"interesting" code points in all widths. UTF-16 data withoutsurrogates should probably work fine. I haven't tuned it thoughlike I tuned the UTF-8 version. Is there a memchr for widecharacters? ;)
What I really haven't done is compared my line parsing codewith multi-code-unit delimiters against one that can do thesame thing. I know Phobos and C FILE * really can't do it. Ihaven't really looked at all in C++, so I should probably lookthere before giving up.
-Steve

In any case, you can download the dataset from [1] if you like.There are several 100 Mb big zip files containing a collection oftmx files (translation memory exchange) with EuropeanLegislation. The files contain multi-alignment texts in up to 24languages. The files are encoded in UCS-2 little-endian. I knowfor a fact (because I compiled the data) that they don't containcharacters outside of the BMP. The data is public and can be usedfreely (as in beer).When I get some time, I will try to port the java app that isdistributed with it to D (partially done yet).

[1]:https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory

Re: Faster Command Line Tools in D

Reply via email to