On Monday, 14 September 2015 at 13:55:50 UTC, Fredrik Boulund wrote:
On Monday, 14 September 2015 at 13:10:50 UTC, Edwin van Leeuwen wrote:
Two things that you could try:

First hitlists.byKey can be expensive (especially if hitlists is big). Instead use:

foreach( key, value ; hitlists )

Also the filter.array.length is quite expensive. You could use count instead.
import std.algorithm : count;
value.count!(h => h.pid >= (max_pid - max_pid_diff));

I didn't know that hitlists.byKey was that expensive, that's just the kind of feedback I was hoping for. I'm just grasping for straws in the online documentation when I want to do things. With my Python background it feels as if I can still get things that work that way.

I picked up D to start learning maybe a couple of years ago. I found Ali's book, Andrei's book, github source code (including for Phobos), and asking here to be the best resources. The docs make perfect sense when you have got to a certain level (or perhaps if you have a computer sciencey background), but can be tough before that (though they are getting better).

You should definitely take a look at the dlangscience project organized by John Colvin and others.

If you like ipython/jupyter also see his pydmagic - write D inline in a notebook.

You may find this series of posts interesting too - another bioinformatics guy migrating from Python:
http://forum.dlang.org/post/akzdstfiwwzfeoudh...@forum.dlang.org

I realize the filter.array.length thing is indeed expensive. I find it especially horrendous that the code I've written needs to allocate a big dynamic array that will most likely be cut down quite drastically in this step. Unfortunately I haven't figured out a good way to do this without storing the intermediary results since I cannot know if there might be yet another hit for any encountered "query" since the input file might not be sorted. But the main reason I didn't just count the values like you suggest is actually that I need the filtered hits in later downstream analysis. The filtered hits for each query are used as input to a lowest common ancestor algorithm on the taxonomic tree (of life).

Unfortunately I haven't time to read your code, and others will do better. But do you use .reserve() ? Also these are a nice fast container library based on Andrei Alexandrescu's allocator:

https://github.com/economicmodeling/containers


Reply via email to