On 15/09/15 12:30 AM, Fredrik Boulund wrote:
Hi,

This is my first post on Dlang forums and I don't have a lot of
experience with D (yet). I mainly code bioinformatics-stuff in Python on
my day-to-day job, but I've been toying with D for a couple of years
now. I had this idea that it'd be fun to write a parser for a text-based
tabular data format I tend to read a lot of in my programs, but I was a
bit stomped that the D implementation I created was slower than my
Python-version. I tried running `dmd -profile` on it but didn't really
understand what I can do to make it go faster. I guess there's some
unnecessary dynamic array extensions being made but I can't figure out
how to do without them, maybe someone can help me out? I tried making
the examples as small as possible.

Here's the code D code: http://dpaste.com/2HP0ZVA
Here's my Python code for comparison: http://dpaste.com/0MPBK67

Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU E5-2670
with RAID6 SAS disks and 192GB of RAM), the D version runs in about 20
seconds and the Python version less than 16 seconds. I've repeated runs
at least thrice when testing. This holds true even if the D version is
compiled with -O.

The file being parsed is the output of a DNA/protein sequence mapping
algorithm called BLAT, but the tabular output format is originally known
from the famous BLAST algorithm.
Here's a short example of what the input files looks like:
http://dpaste.com/017N58F
The format is TAB-delimited: query, target, percent_identity,
alignment_length, mismatches, gaps, query_start, query_end,
target_start, target_end, e-value, bitscore
In the example the output is sorted by query, but this cannot be assumed
to hold true for all cases. The input file varies in range from several
hundred megabytes to several gigabytes (10+ GiB).

A brief explanation on what the code does:
Parse each line,
Only accept records with percent_identity >= min_identity (90.0) and
alignment_length >= min_matches (10),
Store all such records as tuples (in the D code this is a struct) in an
array in an associative array indexed by 'query',
For each query, remove any records with percent_id less than 5
percentage points less than the highest value observed for that query,
Write results to stdout (in my real code the data is subject to further
downstream processing)

This was all just for me learning to do some basic stuff in D, e.g. file
handling, streaming data from disk, etc. I'm really curious what I can
do to improve the D code. My original idea was that maybe I should
compile the performance critical parts of my Python codebase to D and
call them with PyD or something, but not I'm not so sure any more. Help
and suggestions appreciated!


A lot of this hasn't been covered I believe.

http://dpaste.dzfl.pl/f7ab2915c3e1

1) You don't need to convert char[] to string via to. No. Too much. Cast it.
2) You don't need byKey, use foreach key, value syntax. That way you won't go around modifying things unnecessarily.

Ok, I disabled GC + reserved a bunch of memory. It probably won't help much actually. In fact may make it fail so keep that in mind.

Humm what else.

I'm worried about that first foreach. I don't think it needs to exist as it does. I believe an input range would be far better. Use a buffer to store the Hit[]'s. Have a subset per set of them.

If the first foreach is an input range, then things become slightly easier in the second. Now you can turn that into it's own input range. Also that .array usage concerns me. Many an allocation there! Hence why the input range should be the return from it.

The last foreach, is lets assume dummy. Keep in mind, stdout is expensive here. DO NOT USE. If you must buffer output then do it large quantities.


Based upon what I can see, you are definitely not able to use your cpu's to the max. There is no way that is the limiting factor here. Maybe your usage of a core is. But not the cpu's itself.

The thing is, you cannot use multiple threads on that first foreach loop to speed things up. No. That needs to happen all on one thread.
Instead after that thread you need to push the result into another.

Perhaps, per thread one lock (mutex) + buffer for hits. Go round robin over all the threads. If mutex is still locked, you'll need to wait. In this situation a locked mutex means all you worker threads are working. So you can't do anything more (anyway).

Of course after all this, the HDD may still be getting hit too hard. In which case I would recommend you memory mapping it. Which should allow the OS to more efficiently handle reading it into memory. But you'll need to rework .byLine for that.


Wow that was a lot at 4:30am! So don't take it too seriously. I'm sure somebody else will rip that to shreds!

Reply via email to