Hi, I was wondering if there is a way to optimize this kind of data processing.
I'm using Windows 10, and I need to parse many relatively small .txt files and
rearrange their content in a table, where table index is a pair of 2D grid
coordinates and table value is a sequence of strings. The order of magnitude is
18_000 .txt files of variable size, on average each file is around 3000 lines.
Below you can find the code. I assume that accessing many small files through
SSD is the main performance bottleneck, and I guess that multi-thread option,
besides requiring some re-engineering (likely above my skills), wouldn't help
in this case, since performances are not CPU bound. I tried as a (much simpler)
excercise to understand if performing line counting across all these .TXT files
by using threadpool could speed up the result... and it didn't. Is there some
not-too-complex optimization to deal with cases like this...i.e. accessing many
small .txt files? Thank you in advance.
import std/[tables, strutils, os]
var
emptyseq = @[""]
line, objMeas: string
gridkey: tuple[x: int, y: int]
lineSeq: seq[string]
gridTable = initTable[gridkey, emptyseq]()
for file in walkFiles("*.txt"):
let f = open(file)
# each obj file has several lines like these: grid coords (x, y) ,
obj_id, measure
# 526100 5043600 MX890M1E -110.58
# 526150 5043600 MX890M1E -110.3
# 526200 5043600 MX890M1E -110.19
# 526250 5043600 MX890M1E -110.13
# (...)
while f.readline(line):
lineSeq = line.split('\t')
gridkey = (x: parseint(lineSeq[0]), y: parseint(lineSeq[1]))
objMeas = lineSeq[2] & "@" & lineSeq[3]
if gridTable.hasKeyOrPut(gridkey, @[objMeas]):
gridtable[gridkey].add(objMeas)
close(f)
let exportcsv = open("gridtable.csv", fmWrite)
for k in gridtable.keys:
exportcsv.writeline($k.x & ";" & $k.y & ";" & gridTable[k].join(","))
exportcsv.close()
# a gridtable entry appear like this, different obj measurements falling in
the same grid tile,
# are inserted in the seq associated to its grid index:
#
526100;5043600;MX428M3E@-93.56,MX890M1E@-110.58,MX890M2E@-87.88,MX890M3E@-104.71
Run