Re: Trying to reduce memory usage

tsbockman via Digitalmars-d-learn Tue, 16 Feb 2021 20:15:39 -0800

On Friday, 12 February 2021 at 01:23:14 UTC, Josh wrote:

I'm trying to read in a text file that has many duplicatedlines and output a file with all the duplicates removed. By theend of this code snippet, the memory usage is ~5x the size ofthe infile (which can be multiple GB each), and when this is ina loop the memory usage becomes unmanageable and often resultsin an OutOfMemory error or just a complete lock up of thesystem. Is there a way to reduce the memory usage of this codewithout sacrificing speed to any noticeable extent? Myassumption is the .sort.uniq needs improving, but I can't thinkof an easier/not much slower way of doing it.

I spent some time experimenting with this problem, and here isthe best solution I found, assuming that perfect de-duplicationis required. (I'll put the code up on GitHub / dub if anyonewants to have a look.)


--------------------------

0) Memory map the input file, so that the program can passaround slices to it directlywithout making copies. This also allows the OS to page it in andout of physical memory

for us, even if it is too large to fit all at once.

1) Pre-compute the required space for all large datastructures, even if an additional pass is required to do so. Thismakes the rest of the algorithm significantly more efficient withmemory, time, and lines of code.

2) Do a top-level bucket sort of the file using a small (8-16bit) hash into some scratch space. The target can be either inRAM, or in another memory-mapped file if we really need tominimize physical memory use.

The small hash can be a few bits taken off the top of a largerhash (I used std.digest.murmurhash). The larger hash is cachedfor use later on, to accelerate string comparisons, avoidunnecessary I/O, and perhaps do another level of bucket sort.

If there is too much data to put in physical memory all at once,be sure to copy the full text of each line into a region of thescratch file where it will be together with the other lines thatshare the same small hash. This is critical, as otherwise thestring comparisons in the next step turn into slow random I/O.

3) For each bucket, sort, filter out duplicates, and write tothe output file. Any sorting algorithm(s) may be used if allassociated data fits in physical memory. If not, use a mergesort, whose access patterns won't thrash the disk too badly.

4) Manually release all large data structures, and delete thescratch file, if one was used. This is not difficult to do, sincetheir life times are well-defined, and ensures that the programwon't hang on to GiB of space any longer than necessary.

--------------------------

I wrote an optimized implementation of this algorithm. It's fast,efficient, and really does work on files too large for physicalmemory. However, it is complicated at almost 800 lines.

On files small enough to fit in RAM, it is similar in speed tothe other solutions posted, but less memory hungry. Memoryconsumption in this case is around (sourceFile.length + 32 *lineCount * 3 / 2) bytes. Run time is similar to other postedsolutions: about 3 seconds per GiB on my desktop.

When using a memory-mapped scratch file to accommodate hugefiles, the physical memory required is aroundmax(largestBucket.data.length + 32 * largestBucket.lineCount * 3/ 2, bucketCount * writeBufferSize) bytes. (Virtual address spaceconsumption is far higher, and the OS will commit however muchphysical memory is available and not needed by other tasks.) Therun time is however long it takes the disk to read the sourcefile twice, write a (sourceFile.length + 32 * lineCount * 3 / 2)byte scratch file, read back the scratch file, and write thedestination file.

I tried it with a 38.8 GiB, 380_000_000 line file on a magnetichard drive. It needed a 50.2 GiB scratch file and took about anhour (after much optimization and many bug fixes).

Re: Trying to reduce memory usage

Reply via email to