Re: Speeding up text file parser (BLAST tabular format)

Rikki Cattermole via Digitalmars-d-learn Mon, 14 Sep 2015 09:36:08 -0700

On 15/09/15 12:30 AM, Fredrik Boulund wrote:

Hi,


This is my first post on Dlang forums and I don't have a lot of
experience with D (yet). I mainly code bioinformatics-stuff in Python on
my day-to-day job, but I've been toying with D for a couple of years
now. I had this idea that it'd be fun to write a parser for a text-based
tabular data format I tend to read a lot of in my programs, but I was a
bit stomped that the D implementation I created was slower than my
Python-version. I tried running `dmd -profile` on it but didn't really
understand what I can do to make it go faster. I guess there's some
unnecessary dynamic array extensions being made but I can't figure out
how to do without them, maybe someone can help me out? I tried making
the examples as small as possible.

Here's the code D code: http://dpaste.com/2HP0ZVA
Here's my Python code for comparison: http://dpaste.com/0MPBK67

Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU E5-2670
with RAID6 SAS disks and 192GB of RAM), the D version runs in about 20
seconds and the Python version less than 16 seconds. I've repeated runs
at least thrice when testing. This holds true even if the D version is
compiled with -O.

The file being parsed is the output of a DNA/protein sequence mapping
algorithm called BLAT, but the tabular output format is originally known
from the famous BLAST algorithm.
Here's a short example of what the input files looks like:
http://dpaste.com/017N58F
The format is TAB-delimited: query, target, percent_identity,
alignment_length, mismatches, gaps, query_start, query_end,
target_start, target_end, e-value, bitscore
In the example the output is sorted by query, but this cannot be assumed
to hold true for all cases. The input file varies in range from several
hundred megabytes to several gigabytes (10+ GiB).

A brief explanation on what the code does:
Parse each line,
Only accept records with percent_identity >= min_identity (90.0) and
alignment_length >= min_matches (10),
Store all such records as tuples (in the D code this is a struct) in an
array in an associative array indexed by 'query',
For each query, remove any records with percent_id less than 5
percentage points less than the highest value observed for that query,
Write results to stdout (in my real code the data is subject to further
downstream processing)

This was all just for me learning to do some basic stuff in D, e.g. file
handling, streaming data from disk, etc. I'm really curious what I can
do to improve the D code. My original idea was that maybe I should
compile the performance critical parts of my Python codebase to D and
call them with PyD or something, but not I'm not so sure any more. Help
and suggestions appreciated!


A lot of this hasn't been covered I believe.

http://dpaste.dzfl.pl/f7ab2915c3e1

1) You don't need to convert char[] to string via to. No. Too much. Cast it.

2) You don't need byKey, use foreach key, value syntax. That way youwon't go around modifying things unnecessarily.

Ok, I disabled GC + reserved a bunch of memory. It probably won't helpmuch actually. In fact may make it fail so keep that in mind.


Humm what else.

I'm worried about that first foreach. I don't think it needs to exist asit does. I believe an input range would be far better. Use a buffer tostore the Hit[]'s. Have a subset per set of them.

If the first foreach is an input range, then things become slightlyeasier in the second. Now you can turn that into it's own input range.Also that .array usage concerns me. Many an allocation there! Hence whythe input range should be the return from it.

The last foreach, is lets assume dummy. Keep in mind, stdout isexpensive here. DO NOT USE. If you must buffer output then do it largequantities.

Based upon what I can see, you are definitely not able to use your cpu'sto the max. There is no way that is the limiting factor here. Maybe yourusage of a core is. But not the cpu's itself.

The thing is, you cannot use multiple threads on that first foreach loopto speed things up. No. That needs to happen all on one thread.

Instead after that thread you need to push the result into another.

Perhaps, per thread one lock (mutex) + buffer for hits. Go round robinover all the threads. If mutex is still locked, you'll need to wait. Inthis situation a locked mutex means all you worker threads are working.So you can't do anything more (anyway).

Of course after all this, the HDD may still be getting hit too hard. Inwhich case I would recommend you memory mapping it. Which should allowthe OS to more efficiently handle reading it into memory. But you'llneed to rework .byLine for that.

Wow that was a lot at 4:30am! So don't take it too seriously. I'm suresomebody else will rip that to shreds!

Re: Speeding up text file parser (BLAST tabular format)

Reply via email to