fetching data from many small .txt files

cblake Tue, 02 May 2023 03:30:17 -0700

First, always best to have reproducible test data! Great initiative, @Zoom!


@tcheran did not specify if the grid was fixed over samples or varying. Either 
could make sense (e.g. with wandering sensors that use the GPS satellite 
network to self-locate), but very different perf numbers & optimization ideas 
arise in the two situations (small hash table, but long lists vs. Zoom's giant 
hash table, short lists). For example, this generator program is similar, but 
uses a fixed grid:
    
    
    import std/[os, random, strutils, strformat], cligen/osUt
    const NLines = 3000
    var rng = initRand(0xDEADBEEF'i64)
    
    if paramCount() != 2: quit "Usage: start end", 1
    let a = parseInt(paramStr(1))
    let b = parseInt(paramStr(2))
    
    var grid: seq[(uint64, uint64)]
    for _ in 1..NLines:
      let x = rng.next mod 1_000_000
      let y = rng.next mod 10_000_00
      grid.add (x, y)
    
    for fNum in a..b:
      var f = open(&"{fNum:05}.txt", fmWrite)
      if not f.isNil:
        for (x, y) in grid:
          let c = rng.rand(40000) - 20000
          let d = c div 100
          let e = abs(c) mod 100
          f.urite x, '\t', y, "\tMX890M1E\t", d, '.', &"{e:02}\n"
        f.close
    
    
    Run

I ran the above with "coarse grained parallelism" (usually fine), i.e.:
    
    
    zoomDat 1 4500& zoomDat 4501 9000& zoomDat 9001 13500& zoomDat 13501 18000&
    
    
    Run

My prior programs have 2 bugs - to match results, `emptySeq` should be declared 
simply as `emptySeq: seq[string]`. Second, there needs to be a `write "\n"` 
post-loop in the CSV output part. Oops.

I haven't compared RAM disks on Windows (someone should post more details on 
that), but on Linux `/dev/shm` on a box with i7-6700k at 4.8GHz and 65 ns 
latency/40 GB/s DIMMs, I get runtimes (in seconds, big enough & well separated 
enough to not worry about measurement error..):

Program| RanGrid| FixedGrid| TinyGrid  
---|---|---|---  
Orig| 48| 40| 27  
cb1| 36| 30| 19  
cb2| 25| 20| 8  
  
That last TinyGrid column is using only 4 distinct grid points (by changing `x` 
& `y` in the output line to `a` & `b` \- an early accidental bug). So, across 
columns we mostly see the effect of `seq` being faster than `Table`.

One can maybe get decent speed-up by going parallel & merging preliminary 
gridtable.csv's - that depends on which diversity of grid values mode obtains.

Unless/until such parallel scale-up, this should not be an IO bound problem on 
an SSD. Even a SATA SSD can probably do 750 MB/s and this problem is only 1365 
MB or maybe 2 seconds, but processing takes much more. With the above generated 
data, for example, `cat *.txt >/dev/null` takes only 0.22 seconds. So, at a 
minimum one would need like 20/.22=90 cores without contention for IO time to 
== CPU time. @tcheran's "2nd run times" almost surely have data cached in DIMMs 
anyway.

fetching data from many small .txt files

Reply via email to