Re: Range of chars (narrow string ranges)

Chris via Digitalmars-d Wed, 29 Apr 2015 03:06:08 -0700

On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:

On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
Would it be much work to show have example code or even anexperimental module that gets rid of auto-decoding, so wecould see what would be affected in general and how actualcode we have would be affected by it?
The topic keeps coming up again and again, and while I'm infavor of anything that enhances performance, I'm afraid ofhaving to refactor large chunks of my code. However, this fearmay be unfounded, but I would need some examples to visualizethe problem.
Honestly, most code won't care. If we just switched out all ofthe auto-decoding right now, pretty much anything using onlyASCII would just work, and most anything that's trying tomanipulate ASCII characters in a Unicode string will just work,whereas code that's specifically manipulating Unicodecharacters might have problems (e.g. comparing front with adchar will no longer have the same result, since front wouldjust be the first code unit rather than necessarily the firstcode point). Since most Phobos range-based functions whichoperate on strings are special-cased on strings already, manyof them would continue to just work (e.g. find returns the samerange type as what's passed to it even if it's given a string,so it might just work with the change, or it might need to betweaked slightly), and those that would then generally eitherneed to call encode on an argument to make it match the stringtype in the cases string types mix (e.g. "foo".find("fo"d)would need to call encode on "fo"d to make it a string forcomparison), or the caller would need to use std.utf.byDchar orstd.uni.byGrapheme to operate on code points or graphemesrather than code units.
The two biggest places in Phobos that would potentially haveproblems are functions that special-cased strings but stillused front and those which have to return a new range type.e.g. filter would be a good example, because it's forced toreturn a new range type. Right now, it would filter on dchars,but with the change, it would filter on the code unit type(most typically char). If you're filtering on ASCII characters,it wouldn't matter aside from the fact that the resulting rangewould have an element type of char rather than dchar, but ifyou're filtering on Unicode characters, it wouldn't workanymore. For situations like that, you'd be forced do usestd.utf.byDchar or std.uni.byGrapheme. However, since moststring code tends to operate on substrings rather thancharacters, I don't know how common it even is to use afunction like filter on a string (as opposed to a range ofstrings). Such code might actually be fairly rare.
So, there _are_ a few functions which stop working the same wayin a potentially silent manner if we just made it so that frontdidn't autodecode anymore. However, in general, because Phobosalmost always special-cases strings, calls to Phobos functionsprobably wouldn't need to change in most cases, and when theydo, a call to byDchar would restore the old behavior. But ofcourse, we'd want to do the transition in a way that didn'tresult in silent behavioral changes that would break code, eventhough in most cases, it wouldn't matter, because most codewill be operating on ASCII strings even if the stringsthemselves contain Unicode - e.g.unicodeString.find(asciiString) is far more common thanunicodeString.find(otherUnicodeString).
I suspect that the code that's at the greatest risk is codethat checks for is(Unqual!(ElementType!Range) == dchar) tooperate on strings and wrapper ranges around strings, since itwould then only match the cases where byDchar had been used. Ingeneral though, the code that's going to run into the mosttrouble is user code that contains range-based functionssimilar to what you might find in Phobos rather than codethat's simply using the Phobos functions like startsWith andfind - i.e. if you're writing range-base code that worriesabout doing stuff like special-casing strings or whichspecifically needs to operate on code points, then you're goingto have to make changes, whereas to a great extent, if allyou're doing is passing strings to Phobos functions, your codewill tend to just work.
To actually see what the impact would be, we'd have to justchange Phobos, I think, and then see what the impact was onuser code. It could be surprising how much or how little itaffects things, though in most cases, I expect that it'll meanthat code will just work. And if we really wanted to do that,we could create a version flag that turned of autodecoding andversion the changes in Phobos appropriately to see what we got.In many cases, if we simply made sure that Phobos functionswhich special-cased strings didn't use front directly butinstead didn't care whether they were operating on ranges ofchar, wchar, or dchar, then we wouldn't even need to versionanything (e.g. find could easily be made to work that way if itdoesn't already), but some functions (like filter) would needto be versioned differently.
So, maybe what we need to do to start is to just go throughPhobos and make as many functions as possible not care aboutwhether they're dealing with strings as ranges of char, wchar,or dchar. And at least then, we'd minimize how much code wouldhave to be versioned differently if we were to test out gettingrid of autodecoding with versioning.
- Jonathan M Davis

This sounds like a good starting point for a transition plan. Oneimportant thing, though, would be to do some benchmarking withand without autodecoding, to see if it really boosts performancein a way that would justify the transition.

Re: Range of chars (narrow string ranges)

Reply via email to