Re: Speeding up text file parser (BLAST tabular format)

Laeeth Isharc via Digitalmars-d-learn Mon, 14 Sep 2015 07:21:20 -0700

On Monday, 14 September 2015 at 13:55:50 UTC, Fredrik Boulundwrote:

On Monday, 14 September 2015 at 13:10:50 UTC, Edwin van Leeuwenwrote:
Two things that you could try:
First hitlists.byKey can be expensive (especially if hitlistsis big). Instead use:
foreach( key, value ; hitlists )
Also the filter.array.length is quite expensive. You could usecount instead.
import std.algorithm : count;
value.count!(h => h.pid >= (max_pid - max_pid_diff));
I didn't know that hitlists.byKey was that expensive, that'sjust the kind of feedback I was hoping for. I'm just graspingfor straws in the online documentation when I want to dothings. With my Python background it feels as if I can stillget things that work that way.

I picked up D to start learning maybe a couple of years ago. Ifound Ali's book, Andrei's book, github source code (includingfor Phobos), and asking here to be the best resources. The docsmake perfect sense when you have got to a certain level (orperhaps if you have a computer sciencey background), but can betough before that (though they are getting better).

You should definitely take a look at the dlangscience projectorganized by John Colvin and others.

If you like ipython/jupyter also see his pydmagic - write Dinline in a notebook.

You may find this series of posts interesting too - anotherbioinformatics guy migrating from Python:

http://forum.dlang.org/post/[email protected]

I realize the filter.array.length thing is indeed expensive. Ifind it especially horrendous that the code I've written needsto allocate a big dynamic array that will most likely be cutdown quite drastically in this step. Unfortunately I haven'tfigured out a good way to do this without storing theintermediary results since I cannot know if there might be yetanother hit for any encountered "query" since the input filemight not be sorted. But the main reason I didn't just countthe values like you suggest is actually that I need thefiltered hits in later downstream analysis. The filtered hitsfor each query are used as input to a lowest common ancestoralgorithm on the taxonomic tree (of life).

Unfortunately I haven't time to read your code, and others willdo better. But do you use .reserve() ? Also these are a nicefast container library based on Andrei Alexandrescu's allocator:


https://github.com/economicmodeling/containers

Re: Speeding up text file parser (BLAST tabular format)

Reply via email to