Re: Opportunity for speedup

2009-03-11 Thread Daniel Drake
2009/3/1 Bobby Powers bobbypow...@gmail.com:
 I can't seem to get ul-warning to come up properly, so if anyone can
 tell me what I'm doing wrong that would be great.  I've got it to work
 by manually placing some symlinks in /etc/rc0.d and /etc/rc6.d, but
 neither Scott's nor my chkconfig comments seem to work.

Here's a fixed ul-warning initscript.


ul-warning
Description: Binary data
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-03-11 Thread Bobby Powers
On Wed, Mar 11, 2009 at 4:13 PM, Daniel Drake d...@laptop.org wrote:

 2009/3/1 Bobby Powers bobbypow...@gmail.com:
  I can't seem to get ul-warning to come up properly, so if anyone can
  tell me what I'm doing wrong that would be great.  I've got it to work
  by manually placing some symlinks in /etc/rc0.d and /etc/rc6.d, but
  neither Scott's nor my chkconfig comments seem to work.

 Here's a fixed ul-warning initscript.


thanks, the fix is pushed.
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-03-03 Thread Gary C Martin
Hi Bobby,

On 1 Mar 2009, at 21:44, Bobby Powers wrote:

 I've fixed a few issues, packaged up bootanim-2.3-1, and (finally)
 actually ran some benchmarks.  Results (all times in seconds):

 fresh os801, from pressing the power button to appearance of sugar's
 prompt for name screen
 80
 79
 78

 with rhgb-client renamed so that init can't find it:
 69
 68

 and with bootanim-2.(1-3) rpm installed:
 67
 67
 67
 68
 67

 If anyone is unconvinced, I could run more tests, but this seems
 pretty good to me.  Its a 15% overall speedup in the boot process.

I've just run a test here with candidate 801; average over 5 runs;  
starting on button press, stopping when XO first appears in users  
colours:

Before bootanim-2.3-1.i386.rpm:

85.9 seconds

After patching:

74.6 seconds

Booting in ugly text mode (includes the 3 sec ok wait):

72.2 seconds

So, if this 10 sec boot saving gets accepted in a future build, you've  
just gained the world 1,400 extra hours of XO usage from the time this  
patch lands, and for every day thereafter (assumes a conservative 500K  
kids boot their XO just once a day on average).

Fantastic work, what an impressive butterfly effect!! :-)

--Gary

 Interesting notes:
 chkconfig doesn't like binary services - it parses services in
 /etc/init.d to look for metadata in comments, and the mechanism to
 override this data (sticking a file with the same name in
 /etc/chkconfig.d with appropriate comments) doesn't seem to work if
 the original script can't be parsed.  So I had to make small wrappers
 for ul-warning, boot-anim-start and boot-anim-stop.  This doesn't seem
 to affect performance.

 I can't seem to get ul-warning to come up properly, so if anyone can
 tell me what I'm doing wrong that would be great.  I've got it to work
 by manually placing some symlinks in /etc/rc0.d and /etc/rc6.d, but
 neither Scott's nor my chkconfig comments seem to work.

 source:
 http://dev.laptop.org/git?p=users/bobbyp/bootanim
 koji-built rpms:
 http://dev.laptop.org/~bobbyp/bootanim/
 (koji task https://koji.fedoraproject.org/koji/taskinfo? 
 taskID=1211738 )

 I don't know if this could make it into 8.2.1, or what the process
 would be toward getting it at least in the Rawhide/SOAS images, but it
 seems pretty low risk (assuming someone can tell me what I'm doing
 wrong w.r.t. ul-warning).

 yours,
 Bobby

 On Thu, Feb 19, 2009 at 3:03 AM, Mitch Bradley w...@laptop.org wrote:
 Cool!

 Bobby Powers wrote:

 On Wed, Feb 11, 2009 at 2:01 AM, Mitch Bradley w...@laptop.org  
 wrote:


 I just measured the time taken by the boot animation by the simple
 technique of renaming /usr/bin/rhgb-client so the initscripts  
 can't find
 it.


 how did you measure exactly? stopwatch? I'd like to recreate the
 tests.  It sounds like you did this on a freshly flashed system?


 Yes on both counts.  Stopwatch on freshly-flashed os7.img .




 With boot animation, OS build 7 (an older 8.2.1 candidate) takes 60
 seconds from first dot (indicating OFW transfer to Linux) to Sugar
 prompt for your name.   Without it, 53 seconds.  I repeated the  
 test
 several times with consistent results.

 Clearly, it should be possible to display that amount of  
 information in
 much less than 7 seconds.

 The boot animation code is in the OLPC domain, not the upstream  
 domain,
 so replacing it should be relatively free of upstream politics.

 So if anybody is interested in implementing a relatively simple
 boot-time speedup, I offer this as low-hanging fruit.

 I suggest 1 second (differential time between animation and no- 
 animation
 cases) as a reasonable target goal, assuming images of the  
 complexity of
 the current ones.  Arbitrary full-screen graphics might require  
 more
 time, but speeding up the baseline case is a good starting point.

 Go wild.


 So I've taken a first cut at this, implemented with the following
 design considerations (mostly from a conversation with Mitch)
 - the Python client/server was reimplemented as several standalone C
 programs (boot-anim-start, boot-anim-client, and some cleanup in
 boot-anim-stop)
 - a client and server was used before because there is state
 information that needs to be saved: we need to keep track of where  
 in
 the animation we are.  We can keep track of this by using offscreen
 memory in the framebuffer (its 16MB in size, and only the first 2ish
 MB is used for the onscreen graphics (my terminology might be off
 here)).  For state we really only need to keep track of 2 integers,
 one for the current frame number and another to store the offset of
 the next diff to apply.
 - on startup we load an initial image into the framebuffer (the  
 first
 1200*900*2 bytes, since we use 2 bytes per pixel for color
 information), and then load in a series of changes to the  
 framebuffer
 image (300KB).  This takes the form of a series of diffs
 - for each update (a valid call to boot-anim-client) we apply the  
 next
 diff in the series to the onscreen 

Re: Opportunity for speedup

2009-03-01 Thread Bobby Powers
I've fixed a few issues, packaged up bootanim-2.3-1, and (finally)
actually ran some benchmarks.  Results (all times in seconds):

fresh os801, from pressing the power button to appearance of sugar's
prompt for name screen
80
79
78

with rhgb-client renamed so that init can't find it:
69
68

and with bootanim-2.(1-3) rpm installed:
67
67
67
68
67

If anyone is unconvinced, I could run more tests, but this seems
pretty good to me.  Its a 15% overall speedup in the boot process.

Interesting notes:
chkconfig doesn't like binary services - it parses services in
/etc/init.d to look for metadata in comments, and the mechanism to
override this data (sticking a file with the same name in
/etc/chkconfig.d with appropriate comments) doesn't seem to work if
the original script can't be parsed.  So I had to make small wrappers
for ul-warning, boot-anim-start and boot-anim-stop.  This doesn't seem
to affect performance.

I can't seem to get ul-warning to come up properly, so if anyone can
tell me what I'm doing wrong that would be great.  I've got it to work
by manually placing some symlinks in /etc/rc0.d and /etc/rc6.d, but
neither Scott's nor my chkconfig comments seem to work.

source:
http://dev.laptop.org/git?p=users/bobbyp/bootanim
koji-built rpms:
http://dev.laptop.org/~bobbyp/bootanim/
(koji task https://koji.fedoraproject.org/koji/taskinfo?taskID=1211738 )

I don't know if this could make it into 8.2.1, or what the process
would be toward getting it at least in the Rawhide/SOAS images, but it
seems pretty low risk (assuming someone can tell me what I'm doing
wrong w.r.t. ul-warning).

yours,
Bobby

On Thu, Feb 19, 2009 at 3:03 AM, Mitch Bradley w...@laptop.org wrote:
 Cool!

 Bobby Powers wrote:

 On Wed, Feb 11, 2009 at 2:01 AM, Mitch Bradley w...@laptop.org wrote:


 I just measured the time taken by the boot animation by the simple
 technique of renaming /usr/bin/rhgb-client so the initscripts can't find
 it.


 how did you measure exactly? stopwatch? I'd like to recreate the
 tests.  It sounds like you did this on a freshly flashed system?


 Yes on both counts.  Stopwatch on freshly-flashed os7.img .




 With boot animation, OS build 7 (an older 8.2.1 candidate) takes 60
 seconds from first dot (indicating OFW transfer to Linux) to Sugar
 prompt for your name.   Without it, 53 seconds.  I repeated the test
 several times with consistent results.

 Clearly, it should be possible to display that amount of information in
 much less than 7 seconds.

 The boot animation code is in the OLPC domain, not the upstream domain,
 so replacing it should be relatively free of upstream politics.

 So if anybody is interested in implementing a relatively simple
 boot-time speedup, I offer this as low-hanging fruit.

 I suggest 1 second (differential time between animation and no-animation
 cases) as a reasonable target goal, assuming images of the complexity of
 the current ones.  Arbitrary full-screen graphics might require more
 time, but speeding up the baseline case is a good starting point.

 Go wild.


 So I've taken a first cut at this, implemented with the following
 design considerations (mostly from a conversation with Mitch)
 - the Python client/server was reimplemented as several standalone C
 programs (boot-anim-start, boot-anim-client, and some cleanup in
 boot-anim-stop)
 - a client and server was used before because there is state
 information that needs to be saved: we need to keep track of where in
 the animation we are.  We can keep track of this by using offscreen
 memory in the framebuffer (its 16MB in size, and only the first 2ish
 MB is used for the onscreen graphics (my terminology might be off
 here)).  For state we really only need to keep track of 2 integers,
 one for the current frame number and another to store the offset of
 the next diff to apply.
 - on startup we load an initial image into the framebuffer (the first
 1200*900*2 bytes, since we use 2 bytes per pixel for color
 information), and then load in a series of changes to the framebuffer
 image (300KB).  This takes the form of a series of diffs
 - for each update (a valid call to boot-anim-client) we apply the next
 diff in the series to the onscreen image and update our state
 information
 - after applying the last diff we have (the end in the animation
 series), freeze the DCON (when I first attempted to freeze the DCON
 when z-boot-anim-stop was called it left the screen in an inconsistent
 state, I believe because of X startup)
 - its designed to be as light as possible, using syscalls instead of
 libc functions as much as possible (the only thing we use libc for is
 string comparison, which could be replaced with a local function).
 while its written like this, I haven't worked on cutting down the
 linking (I need some guidance for that)


 To reduce the execution footprint, you could try linking it against
 dietlibc, http://www.fefe.de/dietlibc/

 I'm not sure just how much time that would save; maybe it wouldn't be
 

Re: Opportunity for speedup

2009-03-01 Thread Gary C Martin
On 1 Mar 2009, at 21:44, Bobby Powers wrote:

 I've fixed a few issues, packaged up bootanim-2.3-1, and (finally)
 actually ran some benchmarks.  Results (all times in seconds):

 fresh os801, from pressing the power button to appearance of sugar's
 prompt for name screen
 80
 79
 78

 with rhgb-client renamed so that init can't find it:
 69
 68

 and with bootanim-2.(1-3) rpm installed:
 67
 67
 67
 68
 67

 If anyone is unconvinced, I could run more tests, but this seems
 pretty good to me.  Its a 15% overall speedup in the boot process.

Hey Bobby, that sounds great, many thanks for putting the effort in!  
I'll try your rpm on one of the XOs here and ping back with some  
additional measurements.

Regards,
--Gary

 Interesting notes:
 chkconfig doesn't like binary services - it parses services in
 /etc/init.d to look for metadata in comments, and the mechanism to
 override this data (sticking a file with the same name in
 /etc/chkconfig.d with appropriate comments) doesn't seem to work if
 the original script can't be parsed.  So I had to make small wrappers
 for ul-warning, boot-anim-start and boot-anim-stop.  This doesn't seem
 to affect performance.

 I can't seem to get ul-warning to come up properly, so if anyone can
 tell me what I'm doing wrong that would be great.  I've got it to work
 by manually placing some symlinks in /etc/rc0.d and /etc/rc6.d, but
 neither Scott's nor my chkconfig comments seem to work.

 source:
 http://dev.laptop.org/git?p=users/bobbyp/bootanim
 koji-built rpms:
 http://dev.laptop.org/~bobbyp/bootanim/
 (koji task https://koji.fedoraproject.org/koji/taskinfo? 
 taskID=1211738 )

 I don't know if this could make it into 8.2.1, or what the process
 would be toward getting it at least in the Rawhide/SOAS images, but it
 seems pretty low risk (assuming someone can tell me what I'm doing
 wrong w.r.t. ul-warning).

 yours,
 Bobby

 On Thu, Feb 19, 2009 at 3:03 AM, Mitch Bradley w...@laptop.org wrote:
 Cool!

 Bobby Powers wrote:

 On Wed, Feb 11, 2009 at 2:01 AM, Mitch Bradley w...@laptop.org  
 wrote:


 I just measured the time taken by the boot animation by the simple
 technique of renaming /usr/bin/rhgb-client so the initscripts  
 can't find
 it.


 how did you measure exactly? stopwatch? I'd like to recreate the
 tests.  It sounds like you did this on a freshly flashed system?


 Yes on both counts.  Stopwatch on freshly-flashed os7.img .




 With boot animation, OS build 7 (an older 8.2.1 candidate) takes 60
 seconds from first dot (indicating OFW transfer to Linux) to Sugar
 prompt for your name.   Without it, 53 seconds.  I repeated the  
 test
 several times with consistent results.

 Clearly, it should be possible to display that amount of  
 information in
 much less than 7 seconds.

 The boot animation code is in the OLPC domain, not the upstream  
 domain,
 so replacing it should be relatively free of upstream politics.

 So if anybody is interested in implementing a relatively simple
 boot-time speedup, I offer this as low-hanging fruit.

 I suggest 1 second (differential time between animation and no- 
 animation
 cases) as a reasonable target goal, assuming images of the  
 complexity of
 the current ones.  Arbitrary full-screen graphics might require  
 more
 time, but speeding up the baseline case is a good starting point.

 Go wild.


 So I've taken a first cut at this, implemented with the following
 design considerations (mostly from a conversation with Mitch)
 - the Python client/server was reimplemented as several standalone C
 programs (boot-anim-start, boot-anim-client, and some cleanup in
 boot-anim-stop)
 - a client and server was used before because there is state
 information that needs to be saved: we need to keep track of where  
 in
 the animation we are.  We can keep track of this by using offscreen
 memory in the framebuffer (its 16MB in size, and only the first 2ish
 MB is used for the onscreen graphics (my terminology might be off
 here)).  For state we really only need to keep track of 2 integers,
 one for the current frame number and another to store the offset of
 the next diff to apply.
 - on startup we load an initial image into the framebuffer (the  
 first
 1200*900*2 bytes, since we use 2 bytes per pixel for color
 information), and then load in a series of changes to the  
 framebuffer
 image (300KB).  This takes the form of a series of diffs
 - for each update (a valid call to boot-anim-client) we apply the  
 next
 diff in the series to the onscreen image and update our state
 information
 - after applying the last diff we have (the end in the animation
 series), freeze the DCON (when I first attempted to freeze the DCON
 when z-boot-anim-stop was called it left the screen in an  
 inconsistent
 state, I believe because of X startup)
 - its designed to be as light as possible, using syscalls instead of
 libc functions as much as possible (the only thing we use libc for  
 is
 string comparison, which could be replaced with a 

Re: Opportunity for speedup

2009-03-01 Thread James Cameron
On Sun, Mar 01, 2009 at 04:44:01PM -0500, Bobby Powers wrote:
 I can't seem to get ul-warning to come up properly, so if anyone can
 tell me what I'm doing wrong that would be great.

What actually goes wrong?  Is ul-warning executed?

-- 
James Cameronmailto:qu...@us.netrek.org http://quozl.netrek.org/
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Mitch Bradley
Cool!

Bobby Powers wrote:
 On Wed, Feb 11, 2009 at 2:01 AM, Mitch Bradley w...@laptop.org wrote:
   
 I just measured the time taken by the boot animation by the simple
 technique of renaming /usr/bin/rhgb-client so the initscripts can't find it.
 

 how did you measure exactly? stopwatch? I'd like to recreate the
 tests.  It sounds like you did this on a freshly flashed system?
   

Yes on both counts.  Stopwatch on freshly-flashed os7.img .


   
 With boot animation, OS build 7 (an older 8.2.1 candidate) takes 60
 seconds from first dot (indicating OFW transfer to Linux) to Sugar
 prompt for your name.   Without it, 53 seconds.  I repeated the test
 several times with consistent results.

 Clearly, it should be possible to display that amount of information in
 much less than 7 seconds.

 The boot animation code is in the OLPC domain, not the upstream domain,
 so replacing it should be relatively free of upstream politics.

 So if anybody is interested in implementing a relatively simple
 boot-time speedup, I offer this as low-hanging fruit.

 I suggest 1 second (differential time between animation and no-animation
 cases) as a reasonable target goal, assuming images of the complexity of
 the current ones.  Arbitrary full-screen graphics might require more
 time, but speeding up the baseline case is a good starting point.

 Go wild.
 

 So I've taken a first cut at this, implemented with the following
 design considerations (mostly from a conversation with Mitch)
 - the Python client/server was reimplemented as several standalone C
 programs (boot-anim-start, boot-anim-client, and some cleanup in
 boot-anim-stop)
 - a client and server was used before because there is state
 information that needs to be saved: we need to keep track of where in
 the animation we are.  We can keep track of this by using offscreen
 memory in the framebuffer (its 16MB in size, and only the first 2ish
 MB is used for the onscreen graphics (my terminology might be off
 here)).  For state we really only need to keep track of 2 integers,
 one for the current frame number and another to store the offset of
 the next diff to apply.
 - on startup we load an initial image into the framebuffer (the first
 1200*900*2 bytes, since we use 2 bytes per pixel for color
 information), and then load in a series of changes to the framebuffer
 image (300KB).  This takes the form of a series of diffs
 - for each update (a valid call to boot-anim-client) we apply the next
 diff in the series to the onscreen image and update our state
 information
 - after applying the last diff we have (the end in the animation
 series), freeze the DCON (when I first attempted to freeze the DCON
 when z-boot-anim-stop was called it left the screen in an inconsistent
 state, I believe because of X startup)
 - its designed to be as light as possible, using syscalls instead of
 libc functions as much as possible (the only thing we use libc for is
 string comparison, which could be replaced with a local function).
 while its written like this, I haven't worked on cutting down the
 linking (I need some guidance for that)
   

To reduce the execution footprint, you could try linking it against 
dietlibc, http://www.fefe.de/dietlibc/

I'm not sure just how much time that would save; maybe it wouldn't be 
significant.  But it's worth a try.


 comments and suggestions welcome :)

 I'd appreciate any testing as well as any code review.  (the shutdown
 image appears to be broken, FYI.  i haven't looked at that in depth,
 its probably a one line fix.)
 rpms (built with mock) are available at
 http://dev.laptop.org/~bobbyp/bootanim/
 and source is avail at
 http://dev.laptop.org/git?p=users/bobbyp/bootanim

 -Bobby
   

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread pgf
mitch wrote:
  Bobby Powers wrote:
   - its designed to be as light as possible, using syscalls instead of
   libc functions as much as possible (the only thing we use libc for is
   string comparison, which could be replaced with a local function).
   while its written like this, I haven't worked on cutting down the
   linking (I need some guidance for that)

great stuff bobby -- i'm happy to help with any remaining details if
you like.

 
  
  To reduce the execution footprint, you could try linking it against 
  dietlibc, http://www.fefe.de/dietlibc/
  
  I'm not sure just how much time that would save; maybe it wouldn't be 
  significant.  But it's worth a try.

my gut says that using already present glibc shared lib will be cheaper
than introducing a new library, even if it's small and static.  but
you're right it's worth a try.

   and source is avail at
   http://dev.laptop.org/git?p=users/bobbyp/bootanim

i took a very brief look.  as a favor to future maintainers,
i think you could either a) merge boot-anim-start/client/stop and
ul-warning into a single executable (much of the code is the
same) or b) extract the common parts (e.g. initial_setup(), and the
code that mmaps the framebuffer) into a boot-anim-utils.c or
something like that.

(and while i'm all for reducing dependencies, the XO has so much
else going on that i don't think using against string libraries
or even stdio will affect things much in the greater scheme of
things.  so i'd have used fputs rather than write(2,...) for
errors.  but i understand the intent.)

paul
=-
 paul fox, p...@laptop.org
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Peter Robinson
 I just measured the time taken by the boot animation by the simple
 technique of renaming /usr/bin/rhgb-client so the initscripts can't find it.

 how did you measure exactly? stopwatch? I'd like to recreate the
 tests.  It sounds like you did this on a freshly flashed system?

There were a number of tools used by some of the Fedora devs for boot
speed when developing plymouth to replace the old RHGB system. It
would be interesting to plymouth in this (both text and graphical) to
see what the comparison is like. It might be possible to get alot of
the wins that Fedora got with very little work as plymouth has a full
plugin system so shouldn't be hard to add the OLPC boot logos in.

Peter
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread C. Scott Ananian
I'd suggest just uncompressing the various image files and re-timing
as a start.  The initial implementation was uncompressed, but people
complained about space usage on the emulator images (which are
uncompressed).  The current code supports both uncompressed and
compressed image formats.  For uncompressed images, putting the bits
on the screen is an mmap and memcpy, so I can't imagine any
implementation being faster than that (it's possible, of course, that
what's stealing CPU is the shell's invocation of the client program;
recoding just that little part in C should be trivial, since it does
nothing but write to a socket IIRC.)

Anyway, further benchmarking of the current implementation is probably
worthwhile before a complete reimplementation is called for.  But if
you want to reimplement it from scratch, go nuts.
 --scott

-- 
 ( http://cscott.net/ )
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Mitch Bradley
C. Scott Ananian wrote:
 I'd suggest just uncompressing the various image files and re-timing
 as a start.  The initial implementation was uncompressed, but people
 complained about space usage on the emulator images (which are
 uncompressed).  The current code supports both uncompressed and
 compressed image formats.  For uncompressed images, putting the bits
 on the screen is an mmap and memcpy, so I can't imagine any
 implementation being faster than that (it's possible, of course, that
 what's stealing CPU is the shell's invocation of the client program;
 recoding just that little part in C should be trivial, since it does
 nothing but write to a socket IIRC.)

 Anyway, further benchmarking of the current implementation is probably
 worthwhile before a complete reimplementation is called for.  But if
 you want to reimplement it from scratch, go nuts.
  --scott

   
It has already been reimplemented.

The disk I/O time for 26 full-screen images is several seconds.

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Wade Brainerd
On Thu, Feb 19, 2009 at 1:22 PM, C. Scott Ananian csc...@laptop.org wrote:

 I'd suggest just uncompressing the various image files and re-timing
 as a start.  The initial implementation was uncompressed, but people
 complained about space usage on the emulator images (which are
 uncompressed).  The current code supports both uncompressed and
 compressed image formats.  For uncompressed images, putting the bits
 on the screen is an mmap and memcpy, so I can't imagine any
 implementation being faster than that (it's possible, of course, that
 what's stealing CPU is the shell's invocation of the client program;
 recoding just that little part in C should be trivial, since it does
 nothing but write to a socket IIRC.)


I implemented a RLE compressor specifically for these 16bit image files the
last time this question came up.  This can certainly be faster than memcpy
since we are talking memory performance.

GZip+RLE also beats plain GZip on size, again due to the contents of the
images.

http://wadeb.com/rle.c
http://wadeb.com/unrle.c

-Wade
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Bobby Powers
On Thu, Feb 19, 2009 at 1:22 PM, C. Scott Ananian csc...@laptop.org wrote:
 I'd suggest just uncompressing the various image files and re-timing
 as a start.  The initial implementation was uncompressed, but people
 complained about space usage on the emulator images (which are
 uncompressed).  The current code supports both uncompressed and
 compressed image formats.  For uncompressed images, putting the bits
 on the screen is an mmap and memcpy, so I can't imagine any
 implementation being faster than that (it's possible, of course, that
 what's stealing CPU is the shell's invocation of the client program;
 recoding just that little part in C should be trivial, since it does
 nothing but write to a socket IIRC.)

 Anyway, further benchmarking of the current implementation is probably
 worthwhile before a complete reimplementation is called for.  But if
 you want to reimplement it from scratch, go nuts.
  --scott

I already re-implemented it - it was a fun optimization project and
introduction to lower level systems programming.  Using Mitch's D565
format to keep track of only the parts of the image that change cut
down the implementation size significantly.  Its now only 2
uncompressed images (frame00.565 and ul-warning.565), and 300KB of
differences for the animation sequence.  I understand reads from video
memory (which I think is what the framebuffer is?) can be extremely
slow, so it could turn out faster to open a D565 file, mmap it and
mcpy the several tens of kilobytes of differences to the framebuffer
than it is to read those differences from one part of video memory to
another.

This is where benchmarking should give some clearer answers.

yours,
Bobby
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Bobby Powers
2009/2/19 Wade Brainerd wad...@gmail.com:
 On Thu, Feb 19, 2009 at 1:22 PM, C. Scott Ananian csc...@laptop.org wrote:

 I'd suggest just uncompressing the various image files and re-timing
 as a start.  The initial implementation was uncompressed, but people
 complained about space usage on the emulator images (which are
 uncompressed).  The current code supports both uncompressed and
 compressed image formats.  For uncompressed images, putting the bits
 on the screen is an mmap and memcpy, so I can't imagine any
 implementation being faster than that (it's possible, of course, that
 what's stealing CPU is the shell's invocation of the client program;
 recoding just that little part in C should be trivial, since it does
 nothing but write to a socket IIRC.)

 I implemented a RLE compressor specifically for these 16bit image files the
 last time this question came up.  This can certainly be faster than memcpy
 since we are talking memory performance.

Can you explain this?  I don't think I have enough knowledge to
evaluate your claim.

bobby
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Wade Brainerd
RLE (run length encoding) compresses sequences of identical pixels (runs)
as value/count pairs.
So abbccc would be stored as 1a 10b 3c.

The decompressor looks like:

while (cur  end)
{
   unsigned short count = *cur++;
   unsigned short value = *cur++;
   while (count--)
  *dest++ = value;
}

This can be faster than memcpy because you are reading significantly less
memory than you would with memcpy, thus fewer cache misses are incurred.

Because the startup images are mostly spans solid colors, this kind of
compression works very well.  If that were not the case, say if there were a
left-to-right gradient in the background, RLE would probably make things
worse, thus you have to be careful when choosing it.

But the smaller size on disk and in memory would probably improve
performance in other ways as well.

Best,
Wade


On Thu, Feb 19, 2009 at 1:49 PM, Bobby Powers bobbypow...@gmail.com wrote:

 2009/2/19 Wade Brainerd wad...@gmail.com:
  On Thu, Feb 19, 2009 at 1:22 PM, C. Scott Ananian csc...@laptop.org
 wrote:
 
  I'd suggest just uncompressing the various image files and re-timing
  as a start.  The initial implementation was uncompressed, but people
  complained about space usage on the emulator images (which are
  uncompressed).  The current code supports both uncompressed and
  compressed image formats.  For uncompressed images, putting the bits
  on the screen is an mmap and memcpy, so I can't imagine any
  implementation being faster than that (it's possible, of course, that
  what's stealing CPU is the shell's invocation of the client program;
  recoding just that little part in C should be trivial, since it does
  nothing but write to a socket IIRC.)
 
  I implemented a RLE compressor specifically for these 16bit image files
 the
  last time this question came up.  This can certainly be faster than
 memcpy
  since we are talking memory performance.

 Can you explain this?  I don't think I have enough knowledge to
 evaluate your claim.

 bobby

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Wade Brainerd
Oh, and you can feed one of the 565 files through my 'rle.c' program to see
the compression ratio firsthand.

On Thu, Feb 19, 2009 at 1:56 PM, Wade Brainerd wad...@gmail.com wrote:

 RLE (run length encoding) compresses sequences of identical pixels (runs)
 as value/count pairs.
 So abbccc would be stored as 1a 10b 3c.

 The decompressor looks like:

 while (cur  end)
 {
unsigned short count = *cur++;
unsigned short value = *cur++;
while (count--)
   *dest++ = value;
 }

 This can be faster than memcpy because you are reading significantly less
 memory than you would with memcpy, thus fewer cache misses are incurred.

 Because the startup images are mostly spans solid colors, this kind of
 compression works very well.  If that were not the case, say if there were a
 left-to-right gradient in the background, RLE would probably make things
 worse, thus you have to be careful when choosing it.

 But the smaller size on disk and in memory would probably improve
 performance in other ways as well.

 Best,
 Wade


 On Thu, Feb 19, 2009 at 1:49 PM, Bobby Powers bobbypow...@gmail.comwrote:

 2009/2/19 Wade Brainerd wad...@gmail.com:
  On Thu, Feb 19, 2009 at 1:22 PM, C. Scott Ananian csc...@laptop.org
 wrote:
 
  I'd suggest just uncompressing the various image files and re-timing
  as a start.  The initial implementation was uncompressed, but people
  complained about space usage on the emulator images (which are
  uncompressed).  The current code supports both uncompressed and
  compressed image formats.  For uncompressed images, putting the bits
  on the screen is an mmap and memcpy, so I can't imagine any
  implementation being faster than that (it's possible, of course, that
  what's stealing CPU is the shell's invocation of the client program;
  recoding just that little part in C should be trivial, since it does
  nothing but write to a socket IIRC.)
 
  I implemented a RLE compressor specifically for these 16bit image files
 the
  last time this question came up.  This can certainly be faster than
 memcpy
  since we are talking memory performance.

 Can you explain this?  I don't think I have enough knowledge to
 evaluate your claim.

 bobby



___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Bobby Powers
On Thu, Feb 19, 2009 at 1:56 PM, Wade Brainerd wad...@gmail.com wrote:
 RLE (run length encoding) compresses sequences of identical pixels (runs)
 as value/count pairs.
 So abbccc would be stored as 1a 10b 3c.
 The decompressor looks like:
 while (cur  end)
 {
unsigned short count = *cur++;
unsigned short value = *cur++;
while (count--)
   *dest++ = value;
 }
 This can be faster than memcpy because you are reading significantly less
 memory than you would with memcpy, thus fewer cache misses are incurred.
 Because the startup images are mostly spans solid colors, this kind of
 compression works very well.  If that were not the case, say if there were a
 left-to-right gradient in the background, RLE would probably make things
 worse, thus you have to be careful when choosing it.
 But the smaller size on disk and in memory would probably improve
 performance in other ways as well.
 Best,
 Wade

thanks, that makes sense
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Mitch Bradley
Bobby Powers wrote:
 On Thu, Feb 19, 2009 at 1:56 PM, Wade Brainerd wad...@gmail.com wrote:
   
 RLE (run length encoding) compresses sequences of identical pixels (runs)
 as value/count pairs.
 So abbccc would be stored as 1a 10b 3c.
 The decompressor looks like:
 while (cur  end)
 {
unsigned short count = *cur++;
unsigned short value = *cur++;
while (count--)
   *dest++ = value;
 }
 This can be faster than memcpy because you are reading significantly less
 memory than you would with memcpy, thus fewer cache misses are incurred.
 Because the startup images are mostly spans solid colors, this kind of
 compression works very well.  If that were not the case, say if there were a
 left-to-right gradient in the background, RLE would probably make things
 worse, thus you have to be careful when choosing it.
 But the smaller size on disk and in memory would probably improve
 performance in other ways as well.
 Best,
 Wade
 

 thanks, that makes sense
   
We are already getting some portion of the possible compression by doing 
the iframe style delta encoding of the second and subsequent frames, 
but the rle is still of some use.  It does a good job of shrinking the 
first frame, and it halves the size of the delta wad.

The first-frame-shrink could also be accomplished by the trick of 
assuming an initial solid background and representing the first frame as 
a delta from that.

In either case, it looks like rle decoding might be a nice addition, as 
it reduces the size of the frames on disk from 1.2 MB to about 140 KB.



___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Mitch Bradley
da...@lang.hm wrote:

 if you have the diff of the images, do you need to read from the 
 framebuffer at all? since you know what you put there, and know what 
 you want to change, can't you just write your changed information to 
 the right place?

The framebuffer in this case is serving as persistent shared memory, 
thus avoiding the extra complexity of a client/server architecture to 
maintain the sequencing state.

The extremely-tiny (4K - 1 memory page) client program initially reads 
the first frame into the on-screen framebuf and the delta set into 
off-screen framebuffer memory.  On subsequent invocations, the client 
copies another delta into the on-screen framebuf.

If it is statically linked and uses only direct syscalls, the exec() 
overhead is minimal - no shell process instantiation, no script startup, 
no ld.so invocations, no mapping in shared libraries, no relocation.

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread david
On Thu, 19 Feb 2009, Mitch Bradley wrote:

 da...@lang.hm wrote:
 
 if you have the diff of the images, do you need to read from the 
 framebuffer at all? since you know what you put there, and know what you 
 want to change, can't you just write your changed information to the right 
 place?

 The framebuffer in this case is serving as persistent shared memory, thus 
 avoiding the extra complexity of a client/server architecture to maintain the 
 sequencing state.

 The extremely-tiny (4K - 1 memory page) client program initially reads the 
 first frame into the on-screen framebuf and the delta set into off-screen 
 framebuffer memory.  On subsequent invocations, the client copies another 
 delta into the on-screen framebuf.

 If it is statically linked and uses only direct syscalls, the exec() overhead 
 is minimal - no shell process instantiation, no script startup, no ld.so 
 invocations, no mapping in shared libraries, no relocation.

right, but why read the current framebuffer? you don't touch most of it, 
you aren't going to do anything different based on what's there (you are 
just going to overlay your new info there) so all you really need to do is 
to write the parts tha need to change.

David Lang
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Mitch Bradley
da...@lang.hm wrote:

 right, but why read the current framebuffer? you don't touch most of 
 it, you aren't going to do anything different based on what's there 
 (you are just going to overlay your new info there) so all you really 
 need to do is to write the parts tha need to change.

You don't read the on-screen part of the framebuffer.  You copy delta 
data from off-screen framebuffer memory to portions of the on-screen 
framebuffer memory.

On-screen vs. off-screen is irrelevant to the speed - read access to the 
memory that is reserved for display controller use is similarly slow 
in both cases.  But considering that the delta data is small compared to 
the full images, it's worth it to store the deltas there, thus avoiding 
the overhead of the other alternatives for maintaining the context from 
one call to the next.

Those alternatives are:

a) Server process maintains context on behalf of repeatedly-executed 
client process.  This incurs the complexity of client-server 
architectures - setup/teardown, library overhead, interprocess 
communication, scheduling.

b) Client program reads new delta data from a file on each invocation.  
This incurs the filesystem overhead of opening a file on each invocation 
(in comparison, the off-screen framebuffer solution requires only a 
single open() and a single read() on the first invocation.

c) Client program reads delta set into a shared memory segment and then 
reattaches to that segment on subsequent invocations.  This is similar 
to the framebuffer approach except that it uses faster memory for the 
persistent storage.  It might be a win from a speed perspective, but it 
is a bit more complex, requiring the program to deal with two memory 
objects instead of just one.  The total amount of time that it could 
possibly save is about 50 mS, since that it the time it takes to read 
the delta set from the off-screen framebuffer.  And if we use the RLE 
encoding suggested by Wade, the amount of off-screen data is halved, so 
the best-case savings are reduced to 25 mS total.


___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread david
On Thu, 19 Feb 2009, Mitch Bradley wrote:

 da...@lang.hm wrote:
 
 right, but why read the current framebuffer? you don't touch most of it, 
 you aren't going to do anything different based on what's there (you are 
 just going to overlay your new info there) so all you really need to do is 
 to write the parts tha need to change.

 You don't read the on-screen part of the framebuffer.  You copy delta data 
 from off-screen framebuffer memory to portions of the on-screen framebuffer 
 memory.

 On-screen vs. off-screen is irrelevant to the speed - read access to the 
 memory that is reserved for display controller use is similarly slow in 
 both cases.  But considering that the delta data is small compared to the 
 full images, it's worth it to store the deltas there, thus avoiding the 
 overhead of the other alternatives for maintaining the context from one call 
 to the next.

 Those alternatives are:

 a) Server process maintains context on behalf of repeatedly-executed client 
 process.  This incurs the complexity of client-server architectures - 
 setup/teardown, library overhead, interprocess communication, scheduling.

 b) Client program reads new delta data from a file on each invocation.  This 
 incurs the filesystem overhead of opening a file on each invocation (in 
 comparison, the off-screen framebuffer solution requires only a single open() 
 and a single read() on the first invocation.

 c) Client program reads delta set into a shared memory segment and then 
 reattaches to that segment on subsequent invocations.  This is similar to the 
 framebuffer approach except that it uses faster memory for the persistent 
 storage.  It might be a win from a speed perspective, but it is a bit more 
 complex, requiring the program to deal with two memory objects instead of 
 just one.  The total amount of time that it could possibly save is about 50 
 mS, since that it the time it takes to read the delta set from the off-screen 
 framebuffer.  And if we use the RLE encoding suggested by Wade, the amount of 
 off-screen data is halved, so the best-case savings are reduced to 25 mS 
 total.

d) compile the delta set into the client program.

does this really need to be a general-purpose solution here? or is this 
really only used for this specific purpose.

David Lang
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-19 Thread Mitch Bradley
da...@lang.hm wrote:

 d) compile the delta set into the client program.

That works, but

1) It requires more work from the VM system on each invocation of the 
client program, which is now 1.x MB instead of 4K.
2) If a deployment wants to change the image set, it needs a compiler 
toolchain instead of a (small) delta-encoding program.

Speed-wise, (d) might be a wash, or perhaps even a slight win.  It 
depends on how efficient the VM system is, and the effectiveness of the 
filesystem buffer cache at preventing re-reads of the client process 
image (paging directly from JFFS2 is not possible).

The framebuffer hack avoids numerous assumptions about the effectiveness 
of clever but complex subsystems (e.g. the VM system, the filesystem 
buffer cache, the shared library mechanisms, zlib, JFFS2 compression, ...).

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Opportunity for speedup

2009-02-18 Thread Bobby Powers
On Wed, Feb 11, 2009 at 2:01 AM, Mitch Bradley w...@laptop.org wrote:
 I just measured the time taken by the boot animation by the simple
 technique of renaming /usr/bin/rhgb-client so the initscripts can't find it.

how did you measure exactly? stopwatch? I'd like to recreate the
tests.  It sounds like you did this on a freshly flashed system?

 With boot animation, OS build 7 (an older 8.2.1 candidate) takes 60
 seconds from first dot (indicating OFW transfer to Linux) to Sugar
 prompt for your name.   Without it, 53 seconds.  I repeated the test
 several times with consistent results.

 Clearly, it should be possible to display that amount of information in
 much less than 7 seconds.

 The boot animation code is in the OLPC domain, not the upstream domain,
 so replacing it should be relatively free of upstream politics.

 So if anybody is interested in implementing a relatively simple
 boot-time speedup, I offer this as low-hanging fruit.

 I suggest 1 second (differential time between animation and no-animation
 cases) as a reasonable target goal, assuming images of the complexity of
 the current ones.  Arbitrary full-screen graphics might require more
 time, but speeding up the baseline case is a good starting point.

 Go wild.

So I've taken a first cut at this, implemented with the following
design considerations (mostly from a conversation with Mitch)
- the Python client/server was reimplemented as several standalone C
programs (boot-anim-start, boot-anim-client, and some cleanup in
boot-anim-stop)
- a client and server was used before because there is state
information that needs to be saved: we need to keep track of where in
the animation we are.  We can keep track of this by using offscreen
memory in the framebuffer (its 16MB in size, and only the first 2ish
MB is used for the onscreen graphics (my terminology might be off
here)).  For state we really only need to keep track of 2 integers,
one for the current frame number and another to store the offset of
the next diff to apply.
- on startup we load an initial image into the framebuffer (the first
1200*900*2 bytes, since we use 2 bytes per pixel for color
information), and then load in a series of changes to the framebuffer
image (300KB).  This takes the form of a series of diffs
- for each update (a valid call to boot-anim-client) we apply the next
diff in the series to the onscreen image and update our state
information
- after applying the last diff we have (the end in the animation
series), freeze the DCON (when I first attempted to freeze the DCON
when z-boot-anim-stop was called it left the screen in an inconsistent
state, I believe because of X startup)
- its designed to be as light as possible, using syscalls instead of
libc functions as much as possible (the only thing we use libc for is
string comparison, which could be replaced with a local function).
while its written like this, I haven't worked on cutting down the
linking (I need some guidance for that)

comments and suggestions welcome :)

I'd appreciate any testing as well as any code review.  (the shutdown
image appears to be broken, FYI.  i haven't looked at that in depth,
its probably a one line fix.)
rpms (built with mock) are available at
http://dev.laptop.org/~bobbyp/bootanim/
and source is avail at
http://dev.laptop.org/git?p=users/bobbyp/bootanim

-Bobby
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel