[julia-users] TinySegmenter benchmark

Steven G. Johnson Tue, 20 Oct 2015 10:50:26 -0700

I thought people might be interested in this cross-language benchmark of a 
realistic application:


     https://github.com/chezou/TinySegmenter.jl/issues/8

TinySegmenter is an algorithm for breaking Japanese text into words, and it 
has been ported by several authors to different programming languages. 
 Michiaki Ariga (@chezou) ported it to Julia, and after optimizing it a bit 
with me he ran some benchmarks comparing the performance to the different 
TinySegmenter ports.  The resulting times (in seconds) for different 
languages were:

JavaScriptPython2Python3JuliaRuby121.0492.8529.640012.36(933+)

The algorithm basically consists of looping over the characters in a 
string, plugging tuples of consecutive characters into a dictionary of 
"scores", and spitting out a word break when the score exceeds a threshold. 
  The biggest speedup in optimizing the Julia code came from using tuples 
of Char (characters) rather than concatenating the chars into strings 
(which avoids the need to create and then discard lots of temporary strings 
by exploiting Julia's fast tuples).

The Julia implementation is also different from the others in that it is 
the only one that operates completely in-place on the text, without 
allocating large temporary arrays of characters and character categories, 
and returns SubStrings rather than copies of the words.  This sped things 
up only slightly, but saves a lot of memory for a large text.

--SGJ

PS. Also, Julia's ability to explicitly type dictionaries caught a bug in 
the original implementation, where the author had missed the fact that the 
ｸﾞ character is actually formed by two codepoints in Unicode.

[julia-users] TinySegmenter benchmark

Reply via email to