I thought people might be interested in this cross-language benchmark of a
realistic application:
https://github.com/chezou/TinySegmenter.jl/issues/8
TinySegmenter is an algorithm for breaking Japanese text into words, and it
has been ported by several authors to different programming languages.
Michiaki Ariga (@chezou) ported it to Julia, and after optimizing it a bit
with me he ran some benchmarks comparing the performance to the different
TinySegmenter ports. The resulting times (in seconds) for different
languages were:
JavaScriptPython2Python3JuliaRuby121.0492.8529.640012.36(933+)
The algorithm basically consists of looping over the characters in a
string, plugging tuples of consecutive characters into a dictionary of
"scores", and spitting out a word break when the score exceeds a threshold.
The biggest speedup in optimizing the Julia code came from using tuples
of Char (characters) rather than concatenating the chars into strings
(which avoids the need to create and then discard lots of temporary strings
by exploiting Julia's fast tuples).
The Julia implementation is also different from the others in that it is
the only one that operates completely in-place on the text, without
allocating large temporary arrays of characters and character categories,
and returns SubStrings rather than copies of the words. This sped things
up only slightly, but saves a lot of memory for a large text.
--SGJ
PS. Also, Julia's ability to explicitly type dictionaries caught a bug in
the original implementation, where the author had missed the fact that the
グ character is actually formed by two codepoints in Unicode.