First off, compiling with the command line option `-d:release` always speeds up
Nim code. Still though it's expected that Nim without release mode is faster
than Python so I blame the nre module. Beyond that, here's some things I
noticed in your code.
Let's go through the `cut` iterator that your code uses.
iterator cut*(sentence:string):string =
let blocks:seq[string] = filter(nre.split(sentence,re_han),proc(x:
string): bool = x.len > 0)
var
tmp = newSeq[string]()
wordStr:string
for blk in blocks:
if isSome(blk.match(re_han)) == true:
for word in internal_cut(blk):
wordStr = $word
if (wordStr in Force_Split_Words == false):
yield wordStr
else:
for c in wordStr:
yield $c
else:
tmp = filter(split(blk,re_skip),proc(x: string): bool = x.len >
0 or x.runeLen()>0)
for x in tmp:
yield x
Run
You use filter and then iterate through the result right after twice here.
Converting iterators to seqs is pretty expensive, so it's best to do all you
can in 1 iteration.
iterator cut*(sentence: string): string =
for blk in sentence.split(re_han):
if blk.len == 0: continue
if blk.match(re_han).isSome:
for word in internal_cut(blk):
let wordStr = $word
if wordStr notin Force_Split_Words:
yield wordStr
else:
for c in wordStr:
yield $c
else:
for x in blk.split(re_skip):
if x.len > 0 or x.runeLen > 0:
yield x
Run
This doesn't really improve performance, but I thought I'd include it anyway:
proc lcut*(sentence:string):seq[string] =
result = lc[y | (y <- cut(sentence)),string ]
Run
There is already a template in system.nim (the default imported module) for
this purpose named
[accumulateResult](https://nim-lang.org/docs/system.html#accumulateResult.t,untyped).
It's used like so:
proc lcut*(sentence: string): seq[string] =
accumulateResult(cut(sentence))
Run
But accumulateResult is deprecated in the devel branch, to our luck you can use
[sequtils.toSeq](https://nim-lang.org/docs/sequtils.html#toSeq.t,untyped) at
your specific callsite:
for line in lines:
discard lcut(line).join("/")
Run
turns to:
# top of file
from sequtils import toSeq
for line in lines:
discard toSeq(cut(line)).join("/")
Run
This probably doesn't have much to do with the slowness, but you can optimize
Table objects with char keys. Tables are currently implemented as a seq of
`tuple[hash, key, value]`, and since for chars key is the same thing as hash it
would use 8 bytes more memory per entry. This might be optimized in a future
version of nim, but for now this works:
proc getFromCharTable[V](charTable: openarray[(char, V)], key: char): V =
for it in charTable:
if it[0] == key:
return it[1]
let foo = {'A': 1, 'B': 2} # the type is an array of (char, int)
echo foo.getFromCharTable('B') # 2
Run