fetching data from many small .txt files

cblake Sun, 30 Apr 2023 11:00:18 -0700

I suspect there may be system settings to optimize small file IO on Windows 10, 
but I am not the person to ask and that is actually not very Nim-specific. I 
will observe that 3,000 lines of 40-ish byte lines is like 120 KiB or 30 
virtual memory pages which may not be what everyone considers "small". All 
together, 18_000*3_000*40 = 2.16e9 bytes which on a modern NVMe SSD should only 
take about 1 second of actual device IO. (I have one that can do that in about 
250 milliseconds..).


You almost surely have much more than 2GB RAM on your computer. You may be able 
to use a `RAM disk 
<https://github.com/nim-lang/RFCs/issues/503#issuecomment-1367542495>` and just 
copy all the files into that (R: or T: or whatever). If you run the Nim code 
against files there then the time should be more CPU bound.

One way to use less CPU time within a stdlib setting would be to do less string 
creation/destruction in both your parsing and printing phases, as in:
    
    
    import std/[tables, os, strutils]
    var
      emptySeq = @[""]
      objMeas: string
      gridKey: tuple[x: int, y: int]
      gridTable = initTable[gridKey, emptySeq]()
      x, y: int
    
    for file in walkFiles("*.txt"):
      for line in file.lines:
        var i = 0
        for field in line.split('\t'):
          if   i == 0: x = parseInt(field)
          elif i == 1: y = parseInt(field)
          elif i == 2: objMeas.setLen 0; objMeas.add field
          elif i == 3: objMeas.add  "@"; objMeas.add field
          inc i
        gridTable.mgetOrPut((x, y), emptySeq).add objMeas
    
    let exportcsv = open("gridtable.csv", fmWrite)
    for k, v in gridTable:
      exportcsv.write k.x, ";", k.y
      for i, objMeas in v:
        exportcsv.write if i == 0: ';' else: ','
        exportcsv.write objMeas
    exportcsv.close()
    
    
    Run

Another way is to use `std/memfiles`. There is an example of that [in this 
thread](https://forum.nim-lang.org/t/9688).

If you can go out to Nimble packages, then with 
[cligen](https://github.com/c-blake/cligen) utility code you may get better 
performance out of something like this:
    
    
    import std/[tables, os], cligen/[mfile, mslice]
    var
      emptySeq = @[""]
      objMeas: string
      gridKey: tuple[x: int, y: int]
      lineSeq: seq[MSlice]
      gridTable = initTable[gridKey, emptySeq]()
    
    for file in walkFiles("*.txt"):
      for ms in mSlices(file):
        discard ms.msplit(lineSeq, '\t', 0)
        gridKey = (x: parseInt(lineSeq[0]),
                   y: parseInt(lineSeq[1]))
        objMeas.setLen 0; objMeas.add lineSeq[2]
        objMeas.add  "@"; objMeas.add lineSeq[3]
        gridTable.mgetOrPut(gridKey, emptySeq).add objMeas
    
    let exportcsv = open("gridtable.csv", fmWrite)
    for k, v in gridTable:
      exportcsv.write k.x, ";", k.y
      for i, objMeas in v:
        exportcsv.write if i == 0: ';' else: ','
        exportcsv.write objMeas
    exportcsv.close()
    
    
    Run

fetching data from many small .txt files

Reply via email to