Imagine my surprise when it took RB a minute to do what perl did in 15
seconds. I'm not surprised that perl is faster, only HOW much faster it
is.
My argument was going to be, "see, difference wasn't that much, and RB is
much easier to read," but these results make it no contest.
So my question is: Why is RB so much slower? What is it doing that's
making
the difference? And could I do something to close the gap?
Regarding your first implementation:
* You're reading/writing data repeatedly in small chunks. Horrible use of
disk I/O. Every read is followed by a write, which means every loop involves
two head seeks, and the reads/writes are much smaller than optimal for a
burst. Disk caching may alleviate some of this, but at best it's just
preventing you from seeing the full effects of using disk I/O in this
manner. This is why your second implementation was more than 2x faster and
almost as fast as Perl.
* I'm not sure, but I would bet that ( "case " + line ) and ( "r = """ +
line + """" + chr( 13 ) ) allocate new strings before writing to disk. (I
doubt the compiler is optimized to recognize what's happening and call Write
repeatedly or, better yet, call a version that accepts an array of strings
to write in order.) You want to minimize memory allocations and copying in
any high speed data processing. RAM is actually slow from the CPU's
perspective. This is tough in RB for several reasons.
* ReplaceAllB( line, """", """""" ) forces a string allocation/copy.
* chr( 13 ) is a wasted function call that forces yet another memory
allocation.
You didn't post your code for the second implementation, but from your
description: the speed boost came from using disk I/O in a much more
efficient manner, but you still copied the 100 MB of data twice, once to the
array, and once to the output string. Worse, split had to allocate a
separate memory block (separate OS call and related overhead tracking) for
every line, unless internally split is doing something creative like
building an array of pointers to different offsets in the original string
memory block. I doubt this, but it's a possible optimization given immutable
strings.
I imagine that some lines of code were forcing additional string
allocations/copies going from the array to the final output as well. For
example, anything like
(myline + chr(13)) would allocate a new block and copy myline to it. So the
data probably ended up copied around 3 or 4x with all the related memory
allocations. Given all of that, it's a testament to split and RB that you
got close to Perl's speed!
Perl is not magic, it's just engineered for the task at hand, efficient
string processing. RB is designed for the task of rapid production of
maintainable code.
It would be easy to hand Perl its lunch using C for this example. I'm not
sure you can do it in RB because the language lacks the structures and
compiler optimizations necessary to efficiently treat and manipulate a block
of memory as an array of values.
Those wishing for faster string processing aren't going to get it
automatically. However, I do wish RB would give us the language structures
and compiler optimizations so that it's possible to output carefully
engineered RB code that can hang with C on data processing.
Daniel L. Taylor
Taylor Design
Computer Consulting & Software Development
[EMAIL PROTECTED]
www.taylor-design.com
_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>
Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>