Hi, I was wondering if there is a way to optimize this kind of data processing. 
I'm using Windows 10, and I need to parse many relatively small .txt files and 
rearrange their content in a table, where table index is a pair of 2D grid 
coordinates and table value is a sequence of strings. The order of magnitude is 
18_000 .txt files of variable size, on average each file is around 3000 lines. 
Below you can find the code. I assume that accessing many small files through 
SSD is the main performance bottleneck, and I guess that multi-thread option, 
besides requiring some re-engineering (likely above my skills), wouldn't help 
in this case, since performances are not CPU bound. I tried as a (much simpler) 
excercise to understand if performing line counting across all these .TXT files 
by using threadpool could speed up the result... and it didn't. Is there some 
not-too-complex optimization to deal with cases like this...i.e. accessing many 
small .txt files? Thank you in advance.
    
    
    import std/[tables, strutils, os]
    var
        emptyseq = @[""]
        line, objMeas: string
        gridkey: tuple[x: int, y: int]
        lineSeq:  seq[string]
        gridTable = initTable[gridkey, emptyseq]()
    
    for file in walkFiles("*.txt"):
        let f = open(file)
        # each obj file has several lines like these: grid coords (x, y) , 
obj_id, measure
        # 526100        5043600 MX890M1E        -110.58
        # 526150        5043600 MX890M1E        -110.3
        # 526200        5043600 MX890M1E        -110.19
        # 526250        5043600 MX890M1E        -110.13
        # (...)
        while f.readline(line):
            lineSeq = line.split('\t')
            gridkey = (x: parseint(lineSeq[0]),  y: parseint(lineSeq[1]))
            objMeas = lineSeq[2] & "@" & lineSeq[3]
            if gridTable.hasKeyOrPut(gridkey, @[objMeas]):
                gridtable[gridkey].add(objMeas)
        close(f)
    let exportcsv  = open("gridtable.csv", fmWrite)
    for k in gridtable.keys:
        exportcsv.writeline($k.x & ";" & $k.y & ";" & gridTable[k].join(","))
    exportcsv.close()
    # a gridtable entry appear like this, different obj measurements falling in 
the same grid tile,
    # are inserted in the seq associated to its grid index:
    # 
526100;5043600;MX428M3E@-93.56,MX890M1E@-110.58,MX890M2E@-87.88,MX890M3E@-104.71
    
    
    Run

Reply via email to