One of the issues we've discussed briefly, was the performance of the
blastxml library, which parses Blast XML output files.  These files can
typically be large, and there is quite a bit of overhead in parsing
them - which poses a challenge for transalign.  It uses Neil Mitchells
'tagsoup' library for the XML bit - I experimented with several options
once upon a time, and I found it to be the most efficient alternative.

Anyway: one thing we thought of, was to check whether it copies out
strings (using ByteString's 'copy'), or just store things as slices of
the input.  The latter could conceivably mean retaining a lot of the
input in memory.

Well, I made a simple program to just extract the query names, and print
the first and last one.  And the sad? news is that whether I insert the
copy operator in blastxml or not, it doesn't make any noticable
difference.

Some other operations:

- blastxml allocates a lot of memory.  In my test, I parse about a million
  lines of XML, and it allocates about 50 gigs.
- perhaps more gravely, even though I only want to keep the names of the
  query sequences - one hundred strings - it copies most of that, about
  43 gigs during GC.  Which, I think, means it is retained for long
  enough that it is moved into the long-term pool.
- consequently, the program spends most of its time GCin, I got only a
  16% productivity.

Not sure how to remedy this, but I thought I'd share in case anybody has
any ideas.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants

Reply via email to