Optimize parsing large file line-by-line

cblake Tue, 26 Apr 2022 15:00:21 -0700

@PMunch \- my hunch about bioinformatics was right. A test file I was using is 
called "1kg_phase1_all.bim" which is widely distributed. I got mine out of 
[here](https://www.dropbox.com/s/k9ptc4kep9hmvz5/1kg_phase1_all.tar.gz?dl=1) 
but I think this file is widely distributed as "1kg_phase1_all" turns up a lot 
of hits. The data seems to be produced by a program called "plink" specifically 
to be easily "split parsed". The md5sum of that data I used is 
4087b5280f40a93025d5d26d777ce6e8 with 1228635806 bytes in 39728178 lines.


Using that, I re-did the "likely Python being ported" to this:
    
    
    def parse_bim(bim, chrom):
      snp = []; a1 = []; a2 = []
      for line in open(bim):
        x = line.split('\t')
        if x[0] == chrom:
          snp.append(x[1]); a1.append(x[4]); a2.append(x[5])
      return (snp, a1, a2)
    assert len(parse_bim("file.bim", "1")[1]) == 3007196
    
    
    Run

That took about 10.5 seconds on my test machine.

Then I optimized the Python a little to:
    
    
    def rowsFor(bim, chrom):
      snp = []; a1 = []; a2 = []; n = 0
      found = False     # plink emits sorted by chromosome
      start = chrom + "\t"
      for line in open(bim):
        if line.startswith(start):
          found = True
          x = line.strip().split('\t')
          yield (x[1], x[4], x[5], n)
          n += 1
        elif found: break
    
    for (snp, a1, a2, n) in rowsFor("file.bim", "1"):
      if n == 3007195: print(snp, a1, a2) # print last
    
    
    Run

This ran in about 1.78 s on Python3.9.12 (and 1.57s on Python2.7.18). { Yes, 
yes..I have heard for _many years_ how someday py3 would catch up to py2 in 
performance. It now seems that day may never come, but this is Nim Forum, so 
moving on... }

I then optimized my original Nim suggestion (tuned for a beginner) to this:
    
    
    import std/memfiles, system/ansi_c
    
    iterator rowsFor(bim, chrom: string):
        (MemSlice, MemSlice, MemSlice, int) =
      var result: (MemSlice, MemSlice, MemSlice, int)
      let chrom = chrom & "\t"
      var found = false     # plink emits sorted by chromosome
      let (p, n) = (cast[pointer](chrom[0].addr), chrom.len)
      var mf = memfiles.open(bim)
      var line: MemFile     # dummy obj to use memSlices on
      for ln in mf.memSlices:
        if ln.size > n and cmemcmp(ln.data, p, n.csize_t) == 0:
          found = true
          line.mem = ln.data; line.size = ln.size
          var i = 0         # only split needed columns
          for col in line.memSlices('\t'):
            if   i == 1: result[0] = col
            elif i == 4: result[1] = col
            elif i == 5: result[2] = col; break
            inc i
          yield result
          inc result[3]
        elif found: break   # all found in one block of lines
      mf.close
    
    for (snp, a1, a2, n) in "file.bim".rowsFor("1"):
      if n == 3_007_195: echo $snp," ",$a1," ",$a2 # print last
    
    
    Run

which runs in 0.124s (--mm:arc -d:danger) on the same machine - over 14X faster 
than the fastest Python. A next step might be parallel parsing, but this is 
already about 84x faster than the original Python.

I don't know if @LLN13 is still reading any of this, but how I usually like to 
put it is: Nim responds well to optimization effort. I would be curious to see 
@PMunch's npeg/parser generator ideas play out if this test data file inspires 
him. :-)

Optimize parsing large file line-by-line

Reply via email to