Also blogged here:
http://headius.blogspot.com/2007/10/another-performance-discovery-rexml.html
I've discovered a really awful bottleneck in REXML processing.
Look at these results for parsing our build.xml:
read content from stream, no DOM
2.592000 0.000000 2.592000 ( 2.592000)
1.326000 0.000000 1.326000 ( 1.326000)
0.853000 0.000000 0.853000 ( 0.853000)
0.620000 0.000000 0.620000 ( 0.620000)
0.471000 0.000000 0.471000 ( 0.471000)
read content once, no DOM
5.323000 0.000000 5.323000 ( 5.323000)
5.328000 0.000000 5.328000 ( 5.328000)
5.209000 0.000000 5.209000 ( 5.209000)
5.173000 0.000000 5.173000 ( 5.173000)
5.138000 0.000000 5.138000 ( 5.138000)
When reading from a stream, the content is read in in chunks, with each
chunk being matched (and therefore encoded/decoded) in turn.
However, when a fully-read string is used in memory, matching proceeds
as follows:
1. set buffer to entire string
2. match against the buffer
3. set buffer to post match
Now this is obviously a little inefficient, but copy-on-write String
helps a lot. However in our case this means that we encode/decode the
entire XML content for every element match. For any nontrivial file,
this is *terrible* overhead.
So what's the fix? Here's the same second benchmark using a StringIO
object passed to the parser.
read content once, no DOM
0.640000 0.000000 0.640000 ( 0.640000)
0.693000 0.000000 0.693000 ( 0.693000)
0.542000 0.000000 0.542000 ( 0.542000)
0.349000 0.000000 0.349000 ( 0.349000)
0.336000 0.000000 0.336000 ( 0.336000)
This is a perfect indication why JRuby's Rails performance is nowhere
near where it could be. Of course the original code would work fine once
our Oniguruma port is complete, but this is a simple change to make for now.
- Charlie
---------------------------------------------------------------------
To unsubscribe from this list please visit:
http://xircles.codehaus.org/manage_email