Hello,
I'm new to Nim, but was tempted to give it a go because I've heard it has the
simplicity of Python and the speed of C. I sat down to write my first Nim
script on last week, where I mimicked a script I had already written in Python.
I was excited to see just how fast Nim would be. The script's purpose was to
traverse a .vcf (Variant Call Format) file that is used to store genetic
mutations from DNA sequencing. This particular .vcf file had ~300k lines and
~15k columns.
My original Python script took ~74min to parse the file, perform some simple
logic, and print results to an output file. I was shocked that my Nim
implementation took ~404min! I originally assumed I was abusing Nim in some
way, but using the profiler, I was surprised to learn that 53% of the time was
spent in strutil::split (I was splitting by tabs; 't'). I then wrote a minimal
example in Nim, where all I do is read through the file and split lines on
tabs, just to make sure I wasn't missing something. It still took ~400 minutes.
Python takes ~30min using the "same" minimal example. I then learned that Nim's
split iterator might speed things up a bit, so I tried that. It definitely
helped, but still took ~178min. A Groovy version took ~11 minutes.
With all of the promise of Nim, I'm surprised that this basic function is so
slow compared to Python. I'm mostly curious what's causing the difference. Can
anyone explain? I've seen related posts here, but they were fairly old and
didn't really delve into why.
I realize that Nim is still young, and I'm not trying to be critical, but I was
surprised, given everything I've heard about Nim's speed. Am I missing
something? Any upcoming updates that will address this?
Really appreciate any feedback.
Here are my three different versions:
**Nim (using split; ~400min)**
import zip/gzipfiles
import strutils
block:
let vcf = newGzFileStream("../my_file.vcf.gz")
var
line: string
toks: seq[string]
sample_toks: seq[string]
while not vcf.atEnd():
line = vcf.readLine()
if not line.startsWith('#'):
toks = line.strip().split('\t')
# Only need sample columns
for i,sample in toks[9..<toks.len]:
sample_toks = sample.split(':')
Run
**Nim (split iterator; ~178min)**
import zip/gzipfiles
import strutils
block:
let vcf = newGzFileStream("../my_file.vcf.gz")
var
line: string
genotype: string
i: int
i = 0
while not vcf.atEnd():
line = vcf.readLine()
if not line.startsWith('#'):
for col in split(line, '\t'):
# Only need sample columns
if i >= 9:
for fmt in split(col, ':'):
genotype = fmt
# Only need the first element (GT)
break
inc(i)
Run
**Python (~30min)**
import sys
import gzip
def main(vcf):
vcf = gzip.open(vcf, 'r')
for line in vcf:
if not line.startswith('#'):
toks = line.strip().split('\t')
for index,sample in enumerate(toks[9:]):
sample_toks = sample.split(':')
if __name__ == "__main__":
main(sys.argv[1])
Run
**Groovy (~11min)**
import java.util.zip.GZIPInputStream
import java.io.FileInputStream
import java.util.ArrayList
def gzInputStream = new GZIPInputStream(new
FileInputStream("../my_file.vcf.gz"))
gzInputStream.withReader { Reader reader ->
def i = 0
def geno
reader.eachLine { String line ->
if(!line.startsWith("#")){
line.split().each{ col ->
i = 0
if(i >= 9){
geno = col.split(':')
}
}
}
}
}
Run