[EMAIL PROTECTED] writes:
> massive improvements in performance (30% the memory, 60% the time),
> now possible to run full perceptron on boxes with 512MB of RAM
Okay, official numbers: 25% the memory and 83% the time.
I actually had to find a box with 2GB of RAM to even run the test (and
also to check that the results were the same on a really big data set).
Running the nobayes-net logs (851285 messages total) for set1 scores:
VSZ RSS wall user system
original code 1038664 1010436 316.13 306.09 9.98
no compression (CSV) 440944 426144 245.89 237.17 8.73
with compression 267752 252036 261.44 253.13 8.31
Main improvements if you were curious:
- the biggie: switching the large hash of arrayrefs (with tests as the
array elements) into an array (the hash was 100% unnecessary as the
keys were sequential integers starting at zero) of packed integers
with a freeze/thaw function for converting between packed integers
and test lists
- revamped readlogs() based on the improved readlogs() in
hit-frequencies (Justin had the additional suggestion to use handler
functions which worked well)
- switching $is_spam{$index} to be a vector (Justin's idea) saved about
40 MB as well (using an array reduces usage from ~41MB to ~16MB
vs. 106 KB for the vector!)
The CVS version did the same thing, except with a join(',', @tests) for
the array elements which was still a huge improvement over the arrayrefs
and the fastest code, but the additional memory savings are an easy
tradeoff to make.
I think there's more speed to squeeze out, but it's not as much of an
issue as the memory usage was.
I suppose that explains why I was unable to run it on my poor 512MB
machine. :-)
Daniel
--
Daniel Quinlan
http://www.pathname.com/~quinlan/