Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

markebbert Sat, 17 Aug 2019 11:45:38 -0700

Hello,

I'm new to Nim, but was tempted to give it a go because I've heard it has the 
simplicity of Python and the speed of C. I sat down to write my first Nim 
script on last week, where I mimicked a script I had already written in Python. 
I was excited to see just how fast Nim would be. The script's purpose was to 
traverse a .vcf (Variant Call Format) file that is used to store genetic 
mutations from DNA sequencing. This particular .vcf file had ~300k lines and 
~15k columns.


My original Python script took ~74min to parse the file, perform some simple 
logic, and print results to an output file. I was shocked that my Nim 
implementation took ~404min! I originally assumed I was abusing Nim in some 
way, but using the profiler, I was surprised to learn that 53% of the time was 
spent in strutil::split (I was splitting by tabs; 't'). I then wrote a minimal 
example in Nim, where all I do is read through the file and split lines on 
tabs, just to make sure I wasn't missing something. It still took ~400 minutes. 
Python takes ~30min using the "same" minimal example. I then learned that Nim's 
split iterator might speed things up a bit, so I tried that. It definitely 
helped, but still took ~178min. A Groovy version took ~11 minutes.

With all of the promise of Nim, I'm surprised that this basic function is so 
slow compared to Python. I'm mostly curious what's causing the difference. Can 
anyone explain? I've seen related posts here, but they were fairly old and 
didn't really delve into why.

I realize that Nim is still young, and I'm not trying to be critical, but I was 
surprised, given everything I've heard about Nim's speed. Am I missing 
something? Any upcoming updates that will address this?

Really appreciate any feedback.

Here are my three different versions:

**Nim (using split; ~400min)**
    
    
     import zip/gzipfiles
    import strutils
    
    block:
      let vcf = newGzFileStream("../my_file.vcf.gz")
      
      var
        line: string
        toks: seq[string]
        sample_toks: seq[string]
      
      while not vcf.atEnd():
        line = vcf.readLine()
        
        if not line.startsWith('#'):
          
          toks = line.strip().split('\t')
            
            # Only need sample columns
            for i,sample in toks[9..<toks.len]:
                sample_toks = sample.split(':')
    
    
    
    Run

**Nim (split iterator; ~178min)**
    
    
     import zip/gzipfiles
    import strutils
    
    block:
      let vcf = newGzFileStream("../my_file.vcf.gz")
      
      var
        line: string
        genotype: string
        i: int
      
      i = 0
      while not vcf.atEnd():
        line = vcf.readLine()
        
        if not line.startsWith('#'):
          
          for col in split(line, '\t'):
            
            # Only need sample columns
            if i >= 9:
              for fmt in split(col, ':'):
                genotype = fmt
                
                # Only need the first element (GT)
                break
            
            inc(i)
    
    
    Run

**Python (~30min)**
    
    
    import sys
    import gzip
    
    def main(vcf):
        vcf = gzip.open(vcf, 'r')
        for line in vcf:
            if not line.startswith('#'):
                
                toks = line.strip().split('\t')
                
                for index,sample in enumerate(toks[9:]):
                    sample_toks = sample.split(':')
    
    
    if __name__ == "__main__":
        main(sys.argv[1])
    
    
     Run

**Groovy (~11min)**
    
    
    import java.util.zip.GZIPInputStream
    import java.io.FileInputStream
    import java.util.ArrayList
    
    def gzInputStream = new GZIPInputStream(new 
FileInputStream("../my_file.vcf.gz"))
    
    gzInputStream.withReader { Reader reader ->
        def i = 0
        def geno
        reader.eachLine { String line ->
            if(!line.startsWith("#")){
                line.split().each{ col ->
                    i = 0
                    if(i >= 9){
                        geno = col.split(':')
                    }
                }
            }
        }
    }
    
    
     Run

Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

Reply via email to