logs-to-c

Daniel Quinlan 3 Dec 2004 11:35:01 -0000

[EMAIL PROTECTED] writes:

> massive improvements in performance (30% the memory, 60% the time),
>   now possible to run full perceptron on boxes with 512MB of RAM


Okay, official numbers: 25% the memory and 83% the time.

I actually had to find a box with 2GB of RAM to even run the test (and
also to check that the results were the same on a really big data set).
Running the nobayes-net logs (851285 messages total) for set1 scores:

                             VSZ     RSS    wall    user  system
  original code          1038664 1010436  316.13  306.09    9.98
  no compression (CSV)    440944  426144  245.89  237.17    8.73
  with compression        267752  252036  261.44  253.13    8.31

Main improvements if you were curious:

 - the biggie: switching the large hash of arrayrefs (with tests as the
   array elements) into an array (the hash was 100% unnecessary as the
   keys were sequential integers starting at zero) of packed integers
   with a freeze/thaw function for converting between packed integers
   and test lists

 - revamped readlogs() based on the improved readlogs() in
   hit-frequencies (Justin had the additional suggestion to use handler
   functions which worked well)

 - switching $is_spam{$index} to be a vector (Justin's idea) saved about
   40 MB as well (using an array reduces usage from ~41MB to ~16MB
   vs. 106 KB for the vector!)

The CVS version did the same thing, except with a join(',', @tests) for
the array elements which was still a huge improvement over the arrayrefs
and the fastest code, but the additional memory savings are an easy
tradeoff to make.

I think there's more speed to squeeze out, but it's not as much of an
issue as the memory usage was.

I suppose that explains why I was unable to run it on my poor 512MB
machine.  :-)

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

logs-to-c

Reply via email to