I agree with @siloamx. I would also point out that for many less 
compiler/parsing-sophisticated programmers, splitting is conceptually simple 
enough to make all the difference. Such programmers may never even have heard 
of "lexing". This is just to re-express @Araq's point about it being "naive" in 
maybe a slightly more accomodating way. Probably, they should learn, but maybe 
they want to focus on some simple problem. Formats aren't always in a 
strict/pure table format amenable to `parsecsv`.

I have some routines in `cligen/mslice` 
([https://github.com/c-blake/cligen](https://github.com/c-blake/cligen)) that 
may help for this. Some example usage is in `examples/cols.nim`. In the `mmap` 
mode of `cols`, I get ~50% the run-time of GNU `awk` for field splitting, 
though I have admittedly never tried with 15,000 columns.

One final, related point is that `gunzip` can be very slow at decompressing and 
is single-threaded. If the average column width is ~20 Bytes, 300k*15k*20 =~ 90 
GB. Working with the uncompressed file directly or using something that can 
decompress in parallel like `Zstd` 
([https://en.wikipedia.org/wiki/Zstandard](https://en.wikipedia.org/wiki/Zstandard))
 may help **a lot** , especially if you have 4-12 idle cores as many do these 
days. On Linux, you should be able to just 
    
    
    let f = popen("pzstd -cdqp8 < myfile.zs".cstring, "r".cstring)
    
    
    Run

(You may have to, just once, `zcat file.gz | pzstd -f -p8 -19 > file.zs` and, 
of course, this will not help if you have a pile of `.gz` files only to be 
processed once.)

Reply via email to