David Mathog wrote:
Scalable Informatics has released Scalable HMMer, an optimized version of HMMer 2.3.2 that is 1.6-2.5x faster per node on benchmark tests run on Opteron systems.

Did you remove the memory organization changes SE put in to make
it run better on the Altivec Macs?  Those really made life hard when I
was trying to optimize this code to run

Hi Dave:

We didn't start from the Altivec patch. It is in a large "ifdef" in fast_algorithms.c. I didn't see memory organization changes in the non-altivec code (though there was a line about some issue with the Intel compilers).

We started from the base p7Viterbi in fast_algorithms, and rewrote the loops a bit.

on our Beowulf with Athlon MP processors.  The problem was the
P7Viterbi data structures didn't fit entirely into cache (no matter

I was worried about cache thrashing (and still am) with our changes. The code isn't complex, but the particulars of the original implementation weren't terribly cache friendly.

how it was organized) and this resulted in toxic query lengths that ran
several times slower.  That is, take a query sequence
of length 1000, run hmmpfam, nip off the last character, run it again,
etc.  It was anything but a smooth function of execution time vs. query

Ohhh.... I would love a test like that. Is this something that you found in general with the baseline code or with the Altivec'ed code? This would be very good to include in our regression testing...

length.  Working around the Altivec stuffed helped some but didn't
entirely eliminate the effect.  Probably the bigger cache on the
Opteron would eliminate this effect for smaller sequences but I'm
guessing you could still run into it with a long query.

We ran an 8000 letter query length as our longest test. If you have some specific test cases which exercise bugs, please let me know what they are and I will see if we can use them.


This has nothing to do with the Parallel implementation though, it
was a data size vs. cache size effect.

That is an issue with this code. The Athlon has a 256k L2 last I remember, and a 128k L1. Rather hard to keep lots of stuff in cache.

Right now the big issue we are running into for another aspect of this project is the lack of a vector max/min function in SSE*. (If anyone from AMD/Intel is listening, this is a *big* issue, and I even have a rough idea how to do it "quickly" in SSE at the expense of many SSE registers.

Joe


Regards,

David Mathog
[EMAIL PROTECTED]
Manager, Sequence Analysis Facility, Biology Division, Caltech

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to