On Wednesday, 6 February 2013 at 11:15:22 UTC, monarch_dodra wrote:
I'm going to try and see with some example files if I can't get something running faster.

Benchmarking and tweaking, I was able to find 3 things that speeds up your program:

1) Make the computeLocal a compile time constant. This will give you a tinsy bit of performance. Depends on if you plan to make it a run-time argument switch I guess.

2) Makes things about 10%-20% faster:
Your "nucleic" and "amino" hash tables map a character to an index. However, given the range of the characters ('A' to 'Z'), you are better off doing a flat array, where each index represents a character, eg: A is index 0, B is index 1. This way, lookup is a simple array indexing, as opposed to a hash table indexing.

You may even get a bigger bang for your buck by simply giving your "_stats" structure a simple "A is index 0, B is index 1", and only "re-order" the data at the end, when you want to read it. (I haven't done this though).

3) Makes things about 100% faster (ran in half the time on my machine): I don't know how mmFile works, but a simple File + "rawRead" seems to get the job done fast. Also, instead of keeping track of an (several) indexes, I merely keep a single slice. The only thing I care about, is if my slice is empty, in which case I re-fill it.

The modified code is here. I'm apparently getting the same output you are, but that doesn't mean there might not be bugs in it. For example, I noticed that you don't strip leading whites, if any, before the first read.
http://dpaste.dzfl.pl/9b9353b8

----
I'd be tempted to re-write the parser using a "byLine" approach, since my quick reading about fastq seems to imply it is a line based format. Or just plain try to write a parser from scratch, putting my own logic and thought into it (all I did was modify your code, without caring about the actual algorithm)

Reply via email to