One of the issues we've discussed briefly, was the performance of the blastxml library, which parses Blast XML output files. These files can typically be large, and there is quite a bit of overhead in parsing them - which poses a challenge for transalign. It uses Neil Mitchells 'tagsoup' library for the XML bit - I experimented with several options once upon a time, and I found it to be the most efficient alternative.
Anyway: one thing we thought of, was to check whether it copies out strings (using ByteString's 'copy'), or just store things as slices of the input. The latter could conceivably mean retaining a lot of the input in memory. Well, I made a simple program to just extract the query names, and print the first and last one. And the sad? news is that whether I insert the copy operator in blastxml or not, it doesn't make any noticable difference. Some other operations: - blastxml allocates a lot of memory. In my test, I parse about a million lines of XML, and it allocates about 50 gigs. - perhaps more gravely, even though I only want to keep the names of the query sequences - one hundred strings - it copies most of that, about 43 gigs during GC. Which, I think, means it is retained for long enough that it is moved into the long-term pool. - consequently, the program spends most of its time GCin, I got only a 16% productivity. Not sure how to remedy this, but I thought I'd share in case anybody has any ideas. -k -- If I haven't seen further, it is by standing in the footprints of giants