@PMunch \- my hunch about bioinformatics was right. A test file I was using is called "1kg_phase1_all.bim" which is widely distributed. I got mine out of [here](https://www.dropbox.com/s/k9ptc4kep9hmvz5/1kg_phase1_all.tar.gz?dl=1) but I think this file is widely distributed as "1kg_phase1_all" turns up a lot of hits. The data seems to be produced by a program called "plink" specifically to be easily "split parsed". The md5sum of that data I used is 4087b5280f40a93025d5d26d777ce6e8 with 1228635806 bytes in 39728178 lines.
Using that, I re-did the "likely Python being ported" to this: def parse_bim(bim, chrom): snp = []; a1 = []; a2 = [] for line in open(bim): x = line.split('\t') if x[0] == chrom: snp.append(x[1]); a1.append(x[4]); a2.append(x[5]) return (snp, a1, a2) assert len(parse_bim("file.bim", "1")[1]) == 3007196 Run That took about 10.5 seconds on my test machine. Then I optimized the Python a little to: def rowsFor(bim, chrom): snp = []; a1 = []; a2 = []; n = 0 found = False # plink emits sorted by chromosome start = chrom + "\t" for line in open(bim): if line.startswith(start): found = True x = line.strip().split('\t') yield (x[1], x[4], x[5], n) n += 1 elif found: break for (snp, a1, a2, n) in rowsFor("file.bim", "1"): if n == 3007195: print(snp, a1, a2) # print last Run This ran in about 1.78 s on Python3.9.12 (and 1.57s on Python2.7.18). { Yes, yes..I have heard for _many years_ how someday py3 would catch up to py2 in performance. It now seems that day may never come, but this is Nim Forum, so moving on... } I then optimized my original Nim suggestion (tuned for a beginner) to this: import std/memfiles, system/ansi_c iterator rowsFor(bim, chrom: string): (MemSlice, MemSlice, MemSlice, int) = var result: (MemSlice, MemSlice, MemSlice, int) let chrom = chrom & "\t" var found = false # plink emits sorted by chromosome let (p, n) = (cast[pointer](chrom[0].addr), chrom.len) var mf = memfiles.open(bim) var line: MemFile # dummy obj to use memSlices on for ln in mf.memSlices: if ln.size > n and cmemcmp(ln.data, p, n.csize_t) == 0: found = true line.mem = ln.data; line.size = ln.size var i = 0 # only split needed columns for col in line.memSlices('\t'): if i == 1: result[0] = col elif i == 4: result[1] = col elif i == 5: result[2] = col; break inc i yield result inc result[3] elif found: break # all found in one block of lines mf.close for (snp, a1, a2, n) in "file.bim".rowsFor("1"): if n == 3_007_195: echo $snp," ",$a1," ",$a2 # print last Run which runs in 0.124s (--mm:arc -d:danger) on the same machine - over 14X faster than the fastest Python. A next step might be parallel parsing, but this is already about 84x faster than the original Python. I don't know if @LLN13 is still reading any of this, but how I usually like to put it is: Nim responds well to optimization effort. I would be curious to see @PMunch's npeg/parser generator ideas play out if this test data file inspires him. :-)