On Tuesday, May 15, 2018 20:36:21 Dennis via Digitalmars-d-learn wrote: > I have a file with two problems: > - It's too big to fit in memory (apparently, I thought 1.5 Gb > would fit but I get an out of memory error when using > std.file.read) > - It is dirty (contains invalid Unicode characters, null bytes in > the middle of lines) > > I want to write a program that splits it up into multiple files, > with the splits happening every n lines. I keep encountering > roadblocks though: > > - You can't give Yes.useReplacementChar to `byLine` and `byLine` > (or `readln`) throws an Exception upon encountering an invalid > character. > - decodeFront doesn't work on inputRanges like > `byChunk(4096).joiner` > - std.algorithm.splitter doesn't work on inputRanges either > - When you convert chunks to arrays, you have the risk of a split > being in the middle of a character with multiple code units > > Is there a simple way to do this?
If you're on a *nix systime, and you're simply looking for a solution to split files and don't necessarily care about writing one, I'd suggest trying the split utility: https://linux.die.net/man/1/split If I had to write it in D, I'd probably just use std.mmap and operate on the files as a dynamic array of ubytes, since if what you care about is '\n', that can easily be searched for without needing any decoding, and using mmap avoids having to chunk anything. - Jonathan M Davis