So, one other thing that is _probably_ obvious but bears mentioning just in
case it didn't occur to @alfrednewman - the number of addressable/seekable
bytes in a file is usually maintained by any modern filesystem on any modern OS
as cheaply accessed metadata. So, if what you really need is not an exact count
but a "reasonable bound" that could be violated once in a while then you may be
able to _really_ accelerate your processing.
As one example text corpus that we all have access to, the current Nim git
repo's `lib/nim/**.nim` files have an average length of 33 bytes and a standard
deviation of about 22 bytes as assessed by this little program:
import os, memfiles
when isMainModule:
var sumSq = 0
for slc in memSlices(inp):
inc(counter)
sumSq += slc.size * slc.size
echo "num lines: ", counter
let mean = float(inp.size) / float(counter)
echo "mean length: ", mean
let meanSq = float(sumSq) / float(counter)
echo "var(length): ", meanSq - mean*mean
You could probably reasonably bound the average number of bytes per line as,
say (average + 4*stdev) which in this case is about 33+22*4 =~ 121 bytes..maybe
round up to 128. Then you could do something like:
import os
var reasonableUpperBoundOnLineCount = int(float(getFileInfo(myPath).size) /
float(128))
If you use that bound to allocate something then you are unlikely to over
allocate memory by more than about 4X which isn't usually considered "that bad"
in this kind of problem space. Depending on what you are doing you can tune
that parameter and you might need to be prepared in your code to "spill" past a
very, very rare 4 standard deviations tail event. This optimization will beat
the pants off any even AVX512 deal that iterates over all the file bytes at
least for this aspect of the calculation. It basically eliminates a whole pass
over the input data in a case that is made common by construction.
Since you have seemed pretty dead set on an exact calculation in other posts, a
small elaboration upon the "embedded assumptions" in this optimization may be
warranted. All that is really relied upon is that some initial sample of files
can predict the distribution of line lengths "well enough" to estimate some
threshold (that "128" divisor") that has "tunably rare" spill overs where they
are rare enough to not cause much slowdown in whatever ultimate calculation you
are actually doing which you have not been clear about.
Another idea along these lines, if, say, the files are processed over and over
again, is to avoid re-computing all those `memchr()` s by writing a little
proc/program to maintain an off-to-the-side file of byte indexes to the
beginnings of lines. The idea here would be that you have two sets of files,
your actual input files and some paired file "foo.idx" with foo.idx containing
just a bunch of binary ints in the native format of the CPU that are either
byte offsets or line lengths effectively caching the answer of the `memchr`.
If you had such index files then when you want to know how many lines a file is
you can `getFileInfo` on the ".idx" file and know immediately. You could be
careful and check modification times on the .idx and the original data file and
that sort of thing, too. Why, you could even add a "smarter" `memSlices` that
checked for such a file and skipped almost all its work and an API call
`numSlices` that skipped all the work if the ".idx" file is up-to-date
according to time stamps.
Basically, except for actually "just computing file statistics", it seems
highly unlikely to me that you should really be optimizing the heck out of
newline counting in and of itself beyond what `memSlices/memchr()` already do.