Bill Pringlemeir wrote:
> >> On 17 Sep 2008, [EMAIL PROTECTED] wrote:
> 
> >>> Isn't this a bit absurd? Maybe I'm missing something but all I can see
> >>> is that your code adds these two optimizations:
> 
> > Bill Pringlemeir wrote:
> 
> >>>   32 - variable -> const
> >>>   hashcode >> variable -> hashcode >> const
> 
> >> The hashcode is also used in the inline function RT_READ_SLOT.
> 
> On 17 Sep 2008, [EMAIL PROTECTED] wrote:
> 
> > Yes, but the shifted hashcode isn't a constant, so RT_READ_SLOT()
> > can hardly benefit from the constant shift value.
> 
> It looks like this,
> 
>       const guint   shift = 32 - bits;                                  \
>       guint32 idx = qhv->vec[i].hashcode >> shift;                  \
>  RT_SLOT_READ -> return 0 != (arena[idx >> 3] & (0x80U >> (i & 0x7)));

I looked at the generated code and the variants really only differed
with respect to the constants used for the shift opcodes. Otherwise,
the code and size was identical. There are no further shortcuts
applied. I think the only case where some additional optimization
was possible is "shift = 16" but you have to do that manually, GCC
at least doesn't make use of it. Maybe not because it doesn't know
but because it's not worth. Assembly wisdom, especially with respect
to x86, becomes outdated very quickly.

> So if bits is 16, it is
> 
>     shift = 32 - 16 = 16
>     idx = hash >> 16
> 
> ***  RT_SLOT_READ arena[ hash >> 16 >> 3 ] & 0x80 >> (hash >> 16 & 7) 
> 
> There are many more constant shift than there use to be.  For all x86
> machines this is actually a big win as the variable shift has to be in
> the 'c' register (afaik).  So the calculated shift has to be loaded
> into cx.  I agree that x86 is stupid, but it also widely used.  I also
> wanted to make sure that I didn't harm any other processors like the
> PPC, ARM, MIPS, etc.  I think that majority of processors are x86 or
> PPC.

I've supplied three different variants now. The one using a lookup
table to avoid shifting 0x80 seems to yield the smallest code. It's
not necessarily the fastest. Using div() looks worthwhile to me
as on x86 div yields both quotient and remainder and it's although
fast on modern x86, so it could be faster than doing two shifts
even if it increases code size slightly.

qrp_can_route_?? should be smaller by about 30% now. Compiling with
the default flags (that is -O2 and no -march) yields 104 byte for
these functions here and it was about 150 before (after removing
the assertion checks).

The most significant improvement comes actually from using qhv->count
differently (saving 31 byte). Using it as loop condition was very
sub-optimal.  Of course that's also due to the small size of the
function. In a larger function where registers are spilled all the time,
it wouldn't matter.

> I do compile with flags specific to my CPU.  However, I never looked
> at the assembler until your post.  gprof was indicating better
> performance, but when I look at the number closely they don't seem to
> make total sense.  My processor isn't switching frequencies as much,
> but this could be just due to network variance.
> 
> Did you see no difference with whatever machine(s) you have?

So far I didn't benchmark any of it. While a micro-benchmark doesn't
necessarily yield realistic results, the macro benchmark depends on
far too many factors. I expect just being connected to a few leaves with
large QRTs could make a significant difference that would completely
overshadow your optimizations.

-- 
Christian

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
gtk-gnutella-devel mailing list
gtk-gnutella-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/gtk-gnutella-devel

Reply via email to