On Monday, May 21, 2018 15:00:09 Dennis via Digitalmars-d-learn wrote: > On Thursday, 17 May 2018 at 21:10:35 UTC, Dennis wrote: > > It's unfortunate that Phobos tells you 'there's problems with > > the encoding' without providing any means to fix it or even > > diagnose it. > > I have to take that back since I found out about std.encoding > which has functions like `sanitize`, but also `transcode`. (My > file turned out to actually be encoded with ANSI / Windows-1252, > not UTF-8) > Documentation is scarce however, and it requires strings instead > of forward ranges. > > @Jon Degenhardt > > > Instead of: > > auto outputFile = new File("output.txt"); > > > > try: > > auto outputFile = File("output.txt", "w"); > > Wow I really butchered that code. So it is the `drop(4)` that > triggers the UTFException?
drop is range-based, so if you give it a string, it's going to decode because of the whole auto-decoding mess with std.range.primitives.front and popFront. If you can't have auto-decoding, you either have to be dealing with functions that you know avoid it, or you need to do use something like std.string.representation or std.utf.byCodeUnit to get around the auto-decoding. If you're dealing with invalid Unicode, you basically have to either convert it all up front or do something like treat it like binary data, or Phobos is going to try to decode it as Unicode and give you a UTFExceptions. > I find Exceptions in range code hard to interpret. Well, if you just look at the stack trace, it should tell you. I don't see why ranges would be any worse than any other code except for maybe the fact that it's typical to chain a lot of calls, and you frequently end up with wrapper types in the stack trace that you're not necessarily familiar with. The big problem here really is that all you're really being told is that your string has invalid Unicode in it somewhere and the chain of function calls that resulted in std.utf.decode being called on your invalid Unicode. But even if you weren't dealing with ranges, if you passed invalid Unicode to something completely string-based which did decoding, you'd run into pretty much the same problem. The data is being used outside of its original context where you could easily figure out what it relates to, so it's going to be a problem by its very nature. The only real solution there is to be controlling the decoding yourself, and even then, it's easy to be in a position where it's hard to figure out where in the data the bad data is unless you've done something like keep track of exactly what index your at, which really doesn't work well once you're dealing with slicing data. > @Kagamin > > > Do it old school? > > I want to be convinved that Range programming works like a charm, > but the procedural approaches remain more flexible (and faster > too) it seems. Thanks for the example. The whole auto-decoding mess makes things worse than they should be, but if you find procedural examples more flexible, then I would guess that that would be simply a matter of getting more experience with ranges. Ranges are far more composable in terms of how they're used, which tends to inherently make them more flexible. However, it does result in code that's a mixture of functional and procedural programming, which can be quite a shift for some folks. So, there's no question that it takes some getting used to, but D does allow for the more classic approaches, and ranges are not always the best approach. As for performance, that depends on the code and the compiler. It wouldn't surprise me if dmd didn't optimize out the range stuff as much as it really should, but it's my understanding that ldc typically manages to generate code where the range abstraction didn't cost you anything. If there's an issue, I think that it's frequently an algorithmic one or the fact that some range-processing has a tendency to process the same data multiple times, because that's the easiest, most abstract way to go about it and works in general but isn't always the best solution. For instance, because of how the range API works, when using splitter, if you iterate through the entire range, you pretty much have to iterate through it twice, because it does look-ahead to find the delimiter and then returns you a slice up to that point, after which, you process that chunk of the data to do whatever it is you want to do with each split piece. At a conceptual level, what you're doing with your code with splitter is then really clean and easy to write, and often, it should be plenty efficient, but it does require going over the data twice, whereas if you looped over the data yourself, looking for each delimiter, you'd only need to iterate over it once. So, in cases like that, I'd fully expect the abstraction to cost you, though whether it costs enough to matter depends on what you're doing. As is the case when dealing with most abstractions, I think that it's mostly a matter of using it where it makes sense to write cleaner code more quickly and then later figuring out the hot spots where you need to optimize better. In many cases, ranges will be pretty much the same as writing loops, and in others, the abstraction is worth the cost. Where it isn't, you don't use them or implement something yourself rather than using the standard function for it, because you can write something faster for your use case. Just the other day, I refactored some code to not use splitter, because in that particular case, it was costing too much, but there are still tons of cases where I'd use splitter without thinking twice about it, because it's the simplest, fastest way to get the job done, and it's going to be fast enough in most cases. - Jonathan M Davis