Re: On string data types
2016-06-07 13:10 GMT+02:00 'Brian Foley' via ganeti-devel < ganeti-devel@googlegroups.com>: > Hi Iustin, > > I'm half inclined to think that if we're going to change the type of UUID > we should do it 'properly', ie change to a strict, unpacked 16-byte array > that implements Eq, Show, Read (and maybe Hashable too) and use it > consistently everywhere. > 16-byte - does that mean you favour Data.Text over ByteString? I was thinking that way as well (due to ease of using one underlying type throughout), but we will pay 2N compared to bytes on disk. And at least for UUIDs, we don't do any string-related processing on them (substrings, casing, etc.) so we could save some memory there. On the other hand, Data.ByteString.Short is a newer addition, so using it might have some small logistical drawbacks, plus that (very rarely) we do have conversions. Hmm… I'd also be tempted to introduce a type that lets us distinguish between > UUIDs for different Ganeti object types, eg Disks, Nodes, Instances, NICs > etc. > Interesting. On one hand, they are the same thing (UUIDs), and we do process them together at one point (although badly - in getAllIDs). On the other hand, they do refer to different things :) Were you thinking of documentation only (type aliases), or hard separation (newtypes)? Efficiency-wise, yeah, we do have the problem where we don't have a suite > of microbenchmarks at the moment, and we don't really know which functions > are critical to overall performance. As you say, it's entirely possible > that improving performance in one area might just cause problems somewhere > else. > > Our testing approach at the moment is to take a relatively heavy workload > -- a parallel 4 node evacuate on a 80 node/1200 instance cluster -- and to > see what problems that exposes, but that's far from perfect. > > For what it's worth, I found that getNodeInstances shows up high in the > profile when doing a RAPI /2/nodes requests, and it would also matter for > ReqNodeInstances queries to confd. Otherwise it doesn't seem to be > something that's heavily used. > Ack to the above. I think I agree with you: we shouldn't make any large-scale changes on > anything but master because there's too much potential to introduce > performance regressions and we don't have a good way to measure them. > Instead, we're focusing on point fixes, and trying to get 2.16 released, > since it has a bunch of random efficiency improvements relative to 2.15, > improves KVM support, and allows configurable SSH key types/sizes. > Cool. Let's see where the discussion regarding Data.Text goes (anybody else wants to chime in), and we can proceed (in master) along those lines. thanks for the feedback! iustin
Re: On string data types
Hi Iustin, I'm half inclined to think that if we're going to change the type of UUID we should do it 'properly', ie change to a strict, unpacked 16-byte array that implements Eq, Show, Read (and maybe Hashable too) and use it consistently everywhere. I'd also be tempted to introduce a type that lets us distinguish between UUIDs for different Ganeti object types, eg Disks, Nodes, Instances, NICs etc. Efficiency-wise, yeah, we do have the problem where we don't have a suite of microbenchmarks at the moment, and we don't really know which functions are critical to overall performance. As you say, it's entirely possible that improving performance in one area might just cause problems somewhere else. Our testing approach at the moment is to take a relatively heavy workload -- a parallel 4 node evacuate on a 80 node/1200 instance cluster -- and to see what problems that exposes, but that's far from perfect. For what it's worth, I found that getNodeInstances shows up high in the profile when doing a RAPI /2/nodes requests, and it would also matter for ReqNodeInstances queries to confd. Otherwise it doesn't seem to be something that's heavily used. I think I agree with you: we shouldn't make any large-scale changes on anything but master because there's too much potential to introduce performance regressions and we don't have a good way to measure them. Instead, we're focusing on point fixes, and trying to get 2.16 released, since it has a bunch of random efficiency improvements relative to 2.15, improves KVM support, and allows configurable SSH key types/sizes. Cheers, Brian. On 5 June 2016 at 09:32, Iustin Popwrote: > Hi all, > > As discussed previously, I took a look at the feasibility of changing > UUID references to bytestrings or short bytestrings. While doable, > there are two problems with this. > > The first is that we don't have performance/load benchmarks that would > tell us how the overall behaviour changes after such a change. I've > tried to look at the unittests (test/hs/htest) as a proxy, but it > doesn't give any useful measures (± a few percents). The individual > benchmarks (e.g. getNodeInstances) do show significant improvement, but > how relevant are these? > > The second one is that there are two kinds of inefficiencies: basic > costs (which are higher in terms of memory for String versus something > else) and conversion costs (e.g. needing to go back to String from > ByteString). While basic costs are/can be high, a conversion cost in a > tight loop can be even higher (hence why getNodeInstances is very slow > right now). So whatever data type change we do, we should make sure > that the switch is complete, so that at runtime we don't have to pay any > conversion costs (as much as possible). > > Given both of these, I think that fixing stable 2.16 is risky; I think a > better approach would be to forego any current release (2.17 seems to be > in beta and has a stable branch, so not feasible either), and focus on > large scale changes in master: > > - switch for parsing from String+JSON to ByteString+Aeson; this should > give nontrivial speed and memory improvements > - decide on whether to use a single data type for both UUIDs and object > names (e.g. Text) or use a split model (ByteStrings vs. Text or > ShortByteString vs. Text) > - convert all object fields according to the above > - convert all internal data paths to not use String anymore > > This would be orthogonal to any algorithmic changes (e.g. hash consing > or similar), which are needed for overall memory use (whereas string > type changes would be useful for localised memory usage and lower cpu > usage due to less conversions). > > What do you think? > > regards, > iustin >
On string data types
Hi all, As discussed previously, I took a look at the feasibility of changing UUID references to bytestrings or short bytestrings. While doable, there are two problems with this. The first is that we don't have performance/load benchmarks that would tell us how the overall behaviour changes after such a change. I've tried to look at the unittests (test/hs/htest) as a proxy, but it doesn't give any useful measures (± a few percents). The individual benchmarks (e.g. getNodeInstances) do show significant improvement, but how relevant are these? The second one is that there are two kinds of inefficiencies: basic costs (which are higher in terms of memory for String versus something else) and conversion costs (e.g. needing to go back to String from ByteString). While basic costs are/can be high, a conversion cost in a tight loop can be even higher (hence why getNodeInstances is very slow right now). So whatever data type change we do, we should make sure that the switch is complete, so that at runtime we don't have to pay any conversion costs (as much as possible). Given both of these, I think that fixing stable 2.16 is risky; I think a better approach would be to forego any current release (2.17 seems to be in beta and has a stable branch, so not feasible either), and focus on large scale changes in master: - switch for parsing from String+JSON to ByteString+Aeson; this should give nontrivial speed and memory improvements - decide on whether to use a single data type for both UUIDs and object names (e.g. Text) or use a split model (ByteStrings vs. Text or ShortByteString vs. Text) - convert all object fields according to the above - convert all internal data paths to not use String anymore This would be orthogonal to any algorithmic changes (e.g. hash consing or similar), which are needed for overall memory use (whereas string type changes would be useful for localised memory usage and lower cpu usage due to less conversions). What do you think? regards, iustin