> compared to a current CVS version of:
> 5000 generations in 90.813806 seconds. 55.057708 generations/sec
> A total of 32768 bytes were allocated
> A total of 130932 DOD runs were made
> A total of 10930 collection runs were made
> Copying a total of 57801936 bytes
>
> so a 97% decrease in the number of collection runs only turns into an 11%
> improvement in total performance.
This prompted me to setup Parrot for profiling in VTune here on my
machine. What follows is a rather longish email, for which I
apologize. Lots of it is due to statistics, which I'm including to allow
people to draw their own conclusions. I know I'm not the only one that's
profiled Parrot, but I've never really seen any results of Parrot
profiling on the list. If anyone has a benchmark they'd like me to run for
them, I'm willing to do that. I think Brent might have wanted this before,
back before I had set this up.
in reverse order of frequency for the more popular functions:
Total Count Function
40500000 singlebyte_skip_forward
20250000 singlebyte_decode
18000000 Parrot_add_i_i_i
11260238 add_header_to_free
11260073 Parrot_allocate
11260071 new_string_header
11245053 Parrot_string_make
10130030 Parrot_string_length
10130003 Parrot_set_i_ic
10125000 Parrot_substr_s_s_i_ic
10125000 Parrot_string_substr
10125000 Parrot_string_compare
9000000 Parrot_mod_i_i_i
9000000 Parrot_ne_s_sc_ic
2189577 Parrot_branch_ic
(yes, we perform the extremely simple singlebyte_skip_forward 40 million
times on a 5000 generation life.pasm)
In order of total time spent in the function ALONE, we have:
Time Function
3004 runops_fast_core
2182 Parrot_string_substr
2039 singlebyte_skip_forward
2018 Parrot_string_compare
1212 Parrot_string_make
1125 Parrot_add_i_i_i
1088 Parrot_mod_i_i_i
925 new_string_header
904 Parrot_substr_s_s_i_ic
836 Parrot_ne_s_sc_ic
828 add_header_to_free
737 singlebyte_decode
688 Parrot_allocate
557 Parrot_set_i_ic
534 mark_PMCs_unused
453 free_unused_buffers
440 Parrot_string_concat
393 Parrot_string_length
354 mark_buffers_unused
202 Parrot_concat_s_sc
Finally, for total time taken spent in a function and it's called
functions, we have:
Total Time Function
9469 Parrot_substr_s_s_i_ic
8564 Parrot_string_substr
5842 Parrot_string_make
4145 Parrot_ne_s_sc_ic
3679 new_string_header
3668 Parrot_string_compare
2753 alloc_more_string_headers
2716 Parrot_do_dod_run
2039 singlebyte_skip_forward
1628 Parrot_concat_s_sc
1425 Parrot_string_concat
1281 free_unused_buffers
1125 Parrot_add_i_i_i
1088 Parrot_mod_i_i_i
956 Parrot_allocate
828 add_header_to_free
737 singlebyte_decode
557 Parrot_set_i_ic
While life normally took 7 seconds to execute, performing a 'call graph
profiling session' took around 287 seconds. The numbers you see above are
in milliseconds, and are obviously inflated as compared to an unprofiled
execution. The profiler lists the total time as 21873 milliseconds.
Due to the amount of string manipulation that life.pasm performs, this is
obviously the bottleneck here. It's interesting to note that GC's DOD and
collection show up, but don't really rate that high, which corresponds to
Peter's findings.
The most expensive functions, in terms of time spent within that function,
are the ones that are called the most often, regardless of how simple
they are. I wasn't sure how much of this is due to profiling overhead
involved in tracking function calls, and how much is indicative of
performance problems.
To test this, since this example only uses singlebyte encoding, I
inlined the contents of the encoding functions (singlebyte_skip_forward
and singlebyte_decode) and tried it again.
Using encodings for string_substr and string_compare:
5000 generations in 7.071001 seconds. 707.113457 generations/sec
Using hardcoded singlebyte-ness for them:
5000 generations in 6.348999 seconds. 787.525716 generations/sec
Or roughly 10%, due to the need to support encodings. This is in
comparison to the profiler's listing 12% of the time being spent in the
those two encoding functions. So it seems pretty accurate, as far as where
time is going.
Few things immediately come to mind:
a) with the current encoding system, we're guaranteed to be slower than
without it. If we want Parrot to be as fast as Perl5, we're deluding
ourselves.
b) the results will probably be worse in more string-heavy applications,
such as regexes, etc.
c) there's gotta be a way to have multiple encodings without sacrificing
that much speed
One approach that came to mind was macros (gah! no! erg!). An equivalent
system would be preprocessing of .c files to produce .c files for the
compiler, via some perl script.
Imagine if we put substring, compare, etc, etc in the encodings vtable.
This would result in code duplication due to much of the logic being
duplicated, but would allow the compiler to heavily-optimize things for
fixed-width encodings such as singlebyte and/or utf32. We can either put
special tags in our string.c that allows it to get preprocessed to
plugin stubs for the various decode, skip_forward, etc that are different
for each encoding.
A macro implementation would be something like:
#define ENCODING singlebyte
#define DECODE(a) (*(const byte_t *)(a))
#define SKIP_FORWARD(a,b) ((const byte_t *)(a) + (b))
#include string.c
#define ENCODING utf32 .....
where string.c is something like:
void ENCODING##_string_compare(..) {
while( ... ) {
INTVAL c1 = DECODE(s1start);
INTVAL c2 = DECODE(s2start);
cmp = c1 - c2;
s1start = SKIP_FORWARD(s1start,1);
s2start = SKIP_FORWARD(s2start,1);
}
}
Which would compile down into optimized code. Then, when we perform a
comparison, we'd simply delegate to the string's vtable. This would
require transcoding for dual-string functions like compare (ugh), but
would be rather simple for stuff like substr. Alternately, we could make a
2d-array for encoding1,encoding2, and dispatch into that table
appropriately. This causes the number of string function compilations to
go up as N^2 with the N=number of encodings, however. (More multidispatch
in disguise! ;)
Re-running the profiling, with the encoding stuff inlined, gives the
following. It would probably be rather close, speedwise, to a version of
parrot where all the string functions were put into the encodings vtable.
There'd be one method table lookup to get the string function, instead of
the simple static function call. But since the contents would be able to
be inlined, instead of performing multiple indirections to the encodings
vtable, there should still be a definite speedup.
Total Count Function
18000000 Parrot_add_i_i_i
11260238 add_header_to_free
11260073 Parrot_allocate
11260071 new_string_header
11245053 Parrot_string_make
10130030 Parrot_string_length
10130003 Parrot_set_i_ic
10125000 Parrot_string_substr
10125000 Parrot_substr_s_s_i_ic
10125000 Parrot_string_compare
9000000 Parrot_ne_s_sc_ic
9000000 Parrot_mod_i_i_i
2189577 Parrot_branch_ic
2094880 buffer_lives
1490680 Parrot_inc_i
1125015 Parrot_string_concat
1125000 Parrot_lt_i_i_ic
1125000 Parrot_concat_s_sc
1125000 Parrot_eq_s_sc_ic
1079915 Parrot_eq_i_ic_ic
261862 mark_used
Time Function
1758 runops_fast_core
1204 Parrot_string_substr
735 Parrot_mod_i_i_i
706 Parrot_string_make
660 Parrot_add_i_i_i
560 new_string_header
558 Parrot_substr_s_s_i_ic
508 Parrot_string_compare
469 Parrot_ne_s_sc_ic
468 Parrot_allocate
444 add_header_to_free
379 mark_PMCs_unused
379 Parrot_string_concat
316 mark_buffers_unused
300 Parrot_set_i_ic
215 free_unused_buffers
Total Time Function
4985 Parrot_substr_s_s_i_ic
4426 Parrot_string_substr
3612 Parrot_string_make
2252 new_string_header
1691 alloc_more_string_headers
1673 Parrot_do_dod_run
1046 Parrot_concat_s_sc
949 Parrot_string_concat
932 Parrot_ne_s_sc_ic
735 Parrot_mod_i_i_i
660 Parrot_add_i_i_i
659 free_unused_buffers
657 Parrot_allocate
508 Parrot_string_compare
444 add_header_to_free
379 mark_PMCs_unused
316 mark_buffers_unused
300 Parrot_set_i_ic
188 Parrot_go_collect
Re-interpreting these results, the next place to fix up would be substr
itself. This is likely slow due to the various if statements in there for
protection, and features (ie, negative offsets, etc). I'm sure Peter's COW
stuff would help eliminate some of the overhead within that function for
copying, although I couldn't be sure.
Beyond that, very simple functions like add_i_i_i, etc are slow, which is
something that we can't do much about. The two solutions are computed goto
(I don't know how to run VTune with GCC, if that's even possible), and
JIT, both of which eliminate the overhead for all these smallish
functions. I could probably apply the JIT patch which supports Win32 and
reprofile it. I'm not sure if/how well that'd work, tho.
Thoughts?
Mike Lambert