My experience making an on-disk merge sort using ranges

Chris Cain Tue, 26 Feb 2013 17:50:38 -0800

Greetings everyone,

First, let me apologize that this is quite a long post. I hopethat it doesn't scare you off.

I wasn't exactly sure where to put this because I have theintention of learning more about D and its overall design as wellas to let others know about some various difficulties I've beenexperiencing while using it for various purposes and point outpotential fixes. As such I was torn between the D.Learn forum andthe vastly more observed D forum. I hope you don't mind myeventual decision on the matter.

Let me say that I was inspired to try out the "ComponentProgramming" aspect of D using its ranges because of theexcellently done recent talk by Walter. As such, I'm mostlyinterested in perfecting this particular approach. Obviously, Ihave no doubt that I could program this in other ways and have noissues whatsoever.


--

So, with that said, let's get to my point:

My goal is to write an on-disk merge sort (nearly) exclusivelyusing ranges. The part of the problem I'm highlighting in thispost is the process of merging multiple sorted files togetherinto one file (in this case, stdout). I claim that this codeshould be acceptable:


---

import std.stdio, std.file, std.algorithm, std.conv;

void main() {
        dirEntries("data", "*.out", SpanMode.shallow)
                .map!(e => File(e.name).byLine(KeepTerminator.yes)
                        .map!(l => l.to!string())()
                )()
                .array()
                .nWayUnion()
                .copy(stdout.lockingTextWriter);
}

---

I was certainly surprised to find out that nWayUnion existed instd.algorithm which essentially does the work for me. However, Iwas more surprised to find out that this does not compile. So,who here thinks this _shouldn't_ work? Look at it pretty closelyand, without investigating the error, justify to yourself whythis shouldn't work before you continue on. I will explain how Iperceive the problem.

I thought I was already way ahead of the game by using to!stringon each line in byLine (because I'm aware of its oddities becauseit gives you the actual buffer it writes to). Honestly, I was abit shocked to discover that such a simple approach won't workbecause moveFront doesn't work with MapResult. Keep in mind,there are two maps here, but the MapResult that causes a problemis the *inner* one (the one which is applied on each line of thefile).

---

Okay, so with that out of the way, I'd like to know what theidiomatic way to solve this particular problem is. Here's what Idid (I don't claim the ranges I created are robustly implemented):


https://gist.github.com/Zshazz/c488e70eee4fd352b789

The first thing I figured out was that if you turned theMapResult into an array (using .array()), then it would work asexpected. However, this is obviously not an acceptable solutionto my problem because I'm doing an on-disk merge sort to sortthings that wouldn't work in RAM. So, finding this out, Irealized that the (likely) problem with MapResult is that it's avalue type and that prevents moveFront from being applicable toit for some reason (I'm unknowledgable to that reasonunfortunately).

My second solution was to write a wrapper for it that turns itinto a reference type (just made a quick class that forwardscalls to MapResult which it holds a copy of) to test my theory.Sure enough, my wrapper worked as expected. However, I comparedit to another program and found it to be much slower, so Iassumed it might be because of the forwarding mechanism slowingit down.

My third solution was to write a reference-based map. I couldn'tmake a value type of map that would work because I didn't knowwhat prevented it from playing nicely with moveFront. Thisapproach was much faster and actually managed to match myhand-written merge code. I was pretty satisfied with thatapproach, but it bothered me that I was essentially duplicatingthe functionality provided by std.algorithm.

My final solution was born out of me spending more time lookingat the reason why my refMap was faster. I discovered that when Iimplemented refMap, I cached the result of applying the functionon front (which differs from the map implementation instd.algorithm). So, I decided to rewrite my original wrapper sothat it caches the result. This provided me the performancebenefits of my refMap, but I didn't feel guilty about duplicatingexisting functionality. So, my caching range is my favoritesolution thus far. Still, I would have much rather have had myoriginal attempt work and a caching range be invented soley toimprove performance, not to just get it to work at all.

---

With all of that in mind, my question is this: Why doesn'tMapResult work with moveFront? Also, if it cannot be made to workwith moveFront as a value type, would it be a good idea to turnit into a reference type? Is there any way to make it transparentto the user so that they don't have to do this sort ofinvestigation to solve what should be a fairly simple problem(merging multiple ranges together, in this case)?



Thank you for taking the time to read this,
Take care.

My experience making an on-disk merge sort using ranges

Reply via email to