Hi,

on the kde-devel mailing list a little program objprelink.c
appeared which increases the startup-time of C++ programs
dramatically (it modifies the .o files before linking).
But it can't be compiled on Cooker (or Rawhide).
I think binutils-2.11.90.0.8-3mdk.rpm is broken.

When trying to compile it
(gcc -O2 -o objprelink objprelink.c -lbfd -liberty)
it fails with the following errors:

/usr/lib/gcc-lib/i586-mandrake-linux-gnu/2.96/../../../libbfd.so: 
undefined reference to `htab_find_slot_with_hash'
/usr/lib/gcc-lib/i586-mandrake-linux-gnu/2.96/../../../libbfd.so: 
undefined reference to `htab_create'
/usr/lib/gcc-lib/i586-mandrake-linux-gnu/2.96/../../../libbfd.so: 
undefined reference to `htab_delete'

It seems libiberty is not current with libbfd. The above three
functions are in libiberty.a but not in libbfd.
The rpm from Rawhide suffers from the very same problem.
I compiled binutils-2.11.2.tar.gz from ftp.gnu.org and in the
resulting libbfd, these symbols are no longer contained
and the objprelink.c compiles and works without problems.

Here is what  Leon Bottou, author of the program wrote:
> > I just inspected the cooker rpms for binutils.
> > Here is the problem I think:
> > You need a version of libiberty that is current with libbfd.
> > The problem is that libbfd.so comes with package
> >     <libbinutils2-2.11.90.0.8-3mdk.i386.rpm>
> > while libiberty.a comes with package
> >     <libbinutils2-devel-2.11.90.0.8-3mdk.i386.rpm>
> >
> > I checked that this later package contains a version of 
libiberty.a
> > that contains the missing function.
> > Some older versions of binutils do not.

See the corrosponding threads in the kde-devel ml:
http://lists.kde.org/?t=99618665700002&w=2&r=1
http://lists.kde.org/?t=99627349300001&w=2&r=1

I attached objprelink.c and the README to this mail.

Cheers,
Andreas Simon



Waldo Bastian's document <http://www.suse.de/~bastian/Export/linking.txt>
demonstrates that the current g++ implementation generates lots of expensive
run-time relocations.  This translates into the slow startup of large C++
applications (KDE, StarOffice, etc.).  

The attached program "objprelink.c" is designed to reduce the problem. 
Expect startup times 30-50% faster.


1) HOWTO
=========

You must first compile objprelink.c as follows:

    $ gcc -O2 -o objprelink objprelink.c -lbfd -liberty

This program must be run on every object file (.o file) that
composes the application or shared library.   

For the KDE packages, for instance, the simplest way consists of
first making a regular build.  The following commands then fix
all object files, and relink all executables and libraries.

    $ find . -name '*.o' -exec objprelink {} \;
    $ find . -name '*.lo' -exec touch {} \;
    $ make

Another approach consists in tweaking the Makefiles.
That works well for QT.




2) PRINCIPLE
=============

The name "objprelink" means that the program must be run before linking shared
libraries or executables.  I will explain the idea using Waldo's little
programs "testclassN.cpp".

-----------------------------------------------------------------
testclassN.cpp
-----------------------------------------------------------------
#include <qwidget.h>
template<int T> class testclass : public QWidget {
public:
  virtual void setSizeIncrement(int w, int h) 
    { QWidget::setSizeIncrement(w+T, h+T); }
};
template class testclass<1>;
template class testclass<2>;
....                           // as many as we want.
template class testclass<N>;
-----------------------------------------------------------------


Let's first compile this program using the regular method.

    $ g++ -c -I$QTDIR/include testclass1.cpp
    $ g++ -shared -o testclass1.so testclass1.o -L$QTDIR/lib -lqt

The resulting object file "testclass1.o" contains several section.
One section contains the virtual table for the class testclass<1>.
Here are the relocations for this section:

----------------------------------------------------------------
BEFORE (vtable relocs for testclass<1>)
----------------------------------------------------------------
RELOCATION RECORDS FOR [.gnu.linkonce.d.__vt_t9testclass1i1]:
OFFSET   TYPE              VALUE
00000004 R_386_32 __tft9testclass1i1
00000008 R_386_32 _._t9testclass1i1
0000000c R_386_32 event__7QWidgetP6QEvent
00000010 R_386_32 eventFilter__7QObjectP7QObjectP6QEvent
00000014 R_386_32 metaObject__C7QWidget
00000018 R_386_32 className__C7QWidget
0000001c R_386_32 setName__7QWidgetPCc
....
----------------------------------------------------------------

Each of these relocations require an expensive symbol lookup at run time.
There will be a relocation to function QWidget::className(..) in the vtable of
every class that inherits QWidget.  The same will happen for the 70+ virtual
functions defined by QWidget.

The "objprelink" program adds one indirection into the vtables.  It inserts a
stub section for each function appearing in vtables and moves the expensive
relocation there:

----------------------------------------------------------------
AFTER  (stub for QWidget::className)
----------------------------------------------------------------
DISASSEMBLY OF [.gnu.linkonce.t.stub.className__C7QWidget]:
00000000 <.gnu.linkonce.t.stub.className__C7QWidget>:
   0:   b8 00 00 00 00          mov    $0x0,%eax
                        1: R_386_32     className__C7QWidget
   5:   ff e0                   jmp    *%eax
----------------------------------------------------------------

All the trick is that there is only one such section per function.  This
section is shared by all the QWidget subclasses defined in this library.  
The vtable relocs are then modified to point to the stub sections.
These relocs will become R_386_RELATIVE in the shared object and
will not require a symbol lookup.

----------------------------------------------------------------
AFTER (vtable relocs for testclass<1>)
----------------------------------------------------------------
RELOCATION RECORDS FOR [.gnu.linkonce.d.__vt_t9testclass1i1]:
OFFSET   TYPE              VALUE
00000004 R_386_32 .gnu.linkonce.t.stub.__tft9testclass1i1
00000008 R_386_32 .gnu.linkonce.t.stub._._t9testclass1i1
0000000c R_386_32 .gnu.linkonce.t.stub.event__7QWidgetP6QEvent
00000010 R_386_32 .gnu.linkonce.t.stub.eventFilter__7QObjectP7QObjectP6QEvent
00000014 R_386_32 .gnu.linkonce.t.stub.metaObject__C7QWidget
00000018 R_386_32 .gnu.linkonce.t.stub.className__C7QWidget
0000001c R_386_32 .gnu.linkonce.t.stub.setName__7QWidgetPCc
....
----------------------------------------------------------------

One important point is that "objprelink" does not change the symbol table.
Undefined symbols remain undefined.  Defined symbols remain defined.  It just
changes the relocation records without modifying the linking semantic.
This is not like option -Bdynamic.



3) RESULTS
===========


The following table compares the numbers of relocations in shared libraries
generated from regular object files (before the slash) and from fixed object
files (after the slash).  Figures are provided for some testclassN programs
and also for the QT library.

------------------------------------------------------------------------------
                     R_386_32  R_386_GLOB_DAT  R_386_JUMP_SLOT  R_386_RELATIVE
------------------------------------------------------------------------------
testclass1.so         106/105       9/9             8/8            3/108
testclass2.so         212/110      13/13            8/8            3/213
testclass5.so         530/125      25/25            8/8            3/528
testclass10.so       1060/150      45/45            8/8            3/1053
testclass20.so       2120/200      85/85            8/8            3/2103
testclass50.so       5300/350     205/205           8/8            3/5253
------------------------------------------------------------------------------
libqt.so            16915/4563   2690/2690        5039/5039      4933/21669
------------------------------------------------------------------------------

Basically it transforms a large number of expensive R_386_32 relocations into
comparatively cheap R_386_RELATIVE relocations.  This is a gain because it
reduces the number of symbol lookups during the dynamic loading.

The following table gives the execution time of an empty main function
dynamically linked with the above shared libraries.  Units are milliseconds
averaged over one hundred runs.

----------------------------------------------------------------
libqt.so              regular     regular    prelink    prelink   
testclass*.so         regular     prelink    regular    prelink   
----------------------------------------------------------------
testclass1.so           60          61         41         40    
testclass2.so           63          62         40         40    
testclass5.so           62          63         41         40    
testclass10.so          64          63         43         40
testclass20.so          67          64         45         42
testclass50.so          74          68         54         45
----------------------------------------------------------------

This shows a 30% improvement when everything gets prelinked.  
I made a few additional measurements using LD_DEBUG=statistics.
These indicate even larger improvements.

I am progressively recompiling the C++ library on my system.  Yesterday night
I recompiled "libqt.so.2.3.1" and installed it.  Then I recompiled
"libqtcups.so" and observed dramatic speedup in the startup time of "qtcups".
These tests provide extensive coverage of the virtual table modifications.


My initial plan consisted in using a R_386_PLT relocation in the stub
sections.  This would buy me lazy symbol binding and even faster startup
times.  This is trickier than it looks because one should not jump into the
PLT without the proper got pointer in %ebx.  I have not been able to achieve
this with an acceptable overhead.  Any ideas ?


objprelink.c.bz2

Reply via email to