Optimize parsing large file line-by-line

cblake Thu, 28 Apr 2022 05:00:17 -0700

You're welcome. It's mostly an adaptation of 
[cligen/examples/linect.nim](https://github.com/c-blake/cligen/blob/master/examples/linect.nim)
 or 
[adix/tests/wf.nim](https://github.com/c-blake/adix/blob/master/tests/wf.nim). 
Relative to your book's "read whole file, then chunk" idea, my `nSplit` does 
one less "pass over the data" by leveraging mmap/random access to the file. 
Both approaches in the large rely on statistical regularity of record sizes to 
avoid switching overhead from more fine-grained parallel dispatch.


The parsing in general (from Linux `perf`) seems ~50% just `memchr` (avx2 
vectorized). That is usually close to the best one can get, though in this case 
it is only about 4 GB/sec (about 10% of my DIMM bw).

It probably bears re-mention here that pre-parsing into random access binary 
formats pays more perf dividends for re-analysis of the same data, making 
"parsing performance" a less than perfect example. DB systems have done binary 
formats for 60 years with query languages "for non-programmers" (that "for" has 
always struck me as a real stretch..LOL). For programmer consumers, you can get 
most/all that perf with little work from something like 
[nio](https://github.com/c-blake/nio) or Vindaar's 
[nimHDF5](https://github.com/Vindaar/nimhdf5). Irregularities of DNA strings 
(that began this discussion) do tend to limit those benefits.

Optimize parsing large file line-by-line

Reply via email to