Olaf writes:

> your program looks almost exaclty how I'd write it, expect for the
> foldl' Christian mentioned.

Nice to hear! It is very simple, as you say, so maybe that's also why
I'm not that far off.

> I also doubt that the Haskell program can really outperform a
> well-written C program on such a simple task.

I agree. But the C-program I am taking on, as it were, is not really
well-written. For one thing, it does malloc()/free() for every line.

(Oh, and it doesn't handle big numbers, it overflows without detecting
it :-))

So I am cheating, by having my program using a probably quite
well-written runtime against a more-or-less naïve C-implementation.

When the time is dominated by disk-access, the timings are very close (C
first, then Haskell):

$ for f in small_29M.fastq large_5G.fastq huge_33G.fastq; do time fastqstats 
$f; done
  Count 199957
  Total 199957 records 9997850 length 50 average

  real    0m0.129s
  user    0m0.098s
  sys     0m0.000s
  Count 10085674
  Total 10085674 records -1893163715 length -187.708 average

  real    0m19.975s
  user    0m8.335s
  sys     0m1.841s
  Count 63074335
  Total 63074335 records -143886218 length -2.28122 average

  real    2m7.448s
  user    0m56.549s
  sys     0m10.825s
  $ for f in small_29M.fastq large_5G.fastq huge_33G.fastq; do time hfastqstats 
$f; done
  Count 199957
  Total 199957 records 9997850 length 50.0 average

  real    0m0.120s
  user    0m0.048s
  sys     0m0.015s
  Count 10085674
  Total 10085674 records 2401803581 length 238.1401 average

  real    0m19.911s
  user    0m4.276s
  sys     0m2.120s
  Count 63074335
  Total 63074335 records 12741015670 length 202.0 average

  real    2m11.627s
  user    0m31.264s
  sys     0m13.468s
  $ 

So what happens when the disk-cache is hot?

I only have 16 GB RAM in my desktop, so I'll exclude the 33 GB file, and
run the two programs a number of times. After 10 runs of each, I get
these numbers (C first again, then Haskell):

  11
  fastqstats
  Count 199957
  Total 199957 records 9997850 length 50 average

  real    0m0.097s
  user    0m0.097s
  sys     0m0.000s
  Count 10085674
  Total 10085674 records -1893163715 length -187.708 average

  real    0m8.681s
  user    0m7.979s
  sys     0m0.696s
  hfastqstats
  Count 199957
  Total 199957 records 9997850 length 50.0 average

  real    0m0.066s
  user    0m0.062s
  sys     0m0.004s
  Count 10085674
  Total 10085674 records 2401803581 length 238.1401 average

  real    0m3.904s
  user    0m3.212s
  sys     0m0.688s
  $ 

which is kind of fun.

> In my eyes, the strength of Haskell is hidden in the readIllumina
> function: Bioinformatics is 50% parsing and converting text formats.

That's also why I like BioPerl a lot - some one else did the parsing for
for me :-)

Thanks for the comments.


  Best regards,

    Adam

-- 
 "No more than that, but very powerful all the same;          Adam Sjøgren
  simple things are good."                               a...@koldfront.dk

Reply via email to