Re: On string data types

2016-06-08 Thread 'Iustin Pop' via ganeti-devel
2016-06-07 13:10 GMT+02:00 'Brian Foley' via ganeti-devel <
ganeti-devel@googlegroups.com>:

> Hi Iustin,
>
> I'm half inclined to think that if we're going to change the type of UUID
> we should do it 'properly', ie change to a strict, unpacked 16-byte array
> that implements Eq, Show, Read (and maybe Hashable too) and use it
> consistently everywhere.
>

16-byte - does that mean you favour Data.Text over ByteString? I was
thinking that way as well (due to ease of using one underlying type
throughout), but we will pay 2N compared to bytes on disk. And at least for
UUIDs, we don't do any string-related processing on them (substrings,
casing, etc.) so we could save some memory there.

On the other hand, Data.ByteString.Short is a newer addition, so using it
might have some small logistical drawbacks, plus that (very rarely) we do
have conversions.

Hmm…

I'd also be tempted to introduce a type that lets us distinguish between
> UUIDs for different Ganeti object types, eg Disks, Nodes, Instances, NICs
> etc.
>

Interesting. On one hand, they are the same thing (UUIDs), and we do
process them together at one point (although badly - in getAllIDs). On the
other hand, they do refer to different things :)

Were you thinking of documentation only (type aliases), or hard separation
(newtypes)?

Efficiency-wise, yeah, we do have the problem where we don't have a suite
> of microbenchmarks at the moment, and we don't really know which functions
> are critical to overall performance. As you say, it's entirely possible
> that improving performance in one area might just cause problems somewhere
> else.
>
> Our testing approach at the moment is to take a relatively heavy workload
> -- a parallel 4 node evacuate on a 80 node/1200 instance cluster -- and to
> see what problems that exposes, but that's far from perfect.
>
> For what it's worth, I found that getNodeInstances shows up high in the
> profile when doing a RAPI /2/nodes requests, and it would also matter for
>  ReqNodeInstances queries to confd. Otherwise it doesn't seem to be
> something that's heavily used.
>

Ack to the above.

I think I agree with you: we shouldn't make any large-scale changes on
> anything but master because there's too much potential to introduce
> performance regressions and we don't have a good way to measure them.
> Instead, we're focusing on point fixes, and trying to get 2.16 released,
> since it has a bunch of random efficiency improvements relative to 2.15,
> improves KVM support, and allows configurable SSH key types/sizes.
>

Cool. Let's see where the discussion regarding Data.Text goes (anybody else
wants to chime in), and we can proceed (in master) along those lines.

thanks for the feedback!
iustin


Re: On string data types

2016-06-07 Thread 'Brian Foley' via ganeti-devel
Hi Iustin,

I'm half inclined to think that if we're going to change the type of UUID
we should do it 'properly', ie change to a strict, unpacked 16-byte array
that implements Eq, Show, Read (and maybe Hashable too) and use it
consistently everywhere.

I'd also be tempted to introduce a type that lets us distinguish between
UUIDs for different Ganeti object types, eg Disks, Nodes, Instances, NICs
etc.

Efficiency-wise, yeah, we do have the problem where we don't have a suite
of microbenchmarks at the moment, and we don't really know which functions
are critical to overall performance. As you say, it's entirely possible
that improving performance in one area might just cause problems somewhere
else.

Our testing approach at the moment is to take a relatively heavy workload
-- a parallel 4 node evacuate on a 80 node/1200 instance cluster -- and to
see what problems that exposes, but that's far from perfect.

For what it's worth, I found that getNodeInstances shows up high in the
profile when doing a RAPI /2/nodes requests, and it would also matter for
 ReqNodeInstances queries to confd. Otherwise it doesn't seem to be
something that's heavily used.

I think I agree with you: we shouldn't make any large-scale changes on
anything but master because there's too much potential to introduce
performance regressions and we don't have a good way to measure them.
Instead, we're focusing on point fixes, and trying to get 2.16 released,
since it has a bunch of random efficiency improvements relative to 2.15,
improves KVM support, and allows configurable SSH key types/sizes.

Cheers,
Brian.

On 5 June 2016 at 09:32, Iustin Pop  wrote:

> Hi all,
>
> As discussed previously, I took a look at the feasibility of changing
> UUID references to bytestrings or short bytestrings.  While doable,
> there are two problems with this.
>
> The first is that we don't have performance/load benchmarks that would
> tell us how the overall behaviour changes after such a change.  I've
> tried to look at the unittests (test/hs/htest) as a proxy, but it
> doesn't give any useful measures (± a few percents).  The individual
> benchmarks (e.g. getNodeInstances) do show significant improvement, but
> how relevant are these?
>
> The second one is that there are two kinds of inefficiencies: basic
> costs (which are higher in terms of memory for String versus something
> else) and conversion costs (e.g. needing to go back to String from
> ByteString).  While basic costs are/can be high, a conversion cost in a
> tight loop can be even higher (hence why getNodeInstances is very slow
> right now).  So whatever data type change we do, we should make sure
> that the switch is complete, so that at runtime we don't have to pay any
> conversion costs (as much as possible).
>
> Given both of these, I think that fixing stable 2.16 is risky; I think a
> better approach would be to forego any current release (2.17 seems to be
> in beta and has a stable branch, so not feasible either), and focus on
> large scale changes in master:
>
> - switch for parsing from String+JSON to ByteString+Aeson; this should
>   give nontrivial speed and memory improvements
> - decide on whether to use a single data type for both UUIDs and object
>   names (e.g. Text) or use a split model (ByteStrings vs. Text or
>   ShortByteString vs. Text)
> - convert all object fields according to the above
> - convert all internal data paths to not use String anymore
>
> This would be orthogonal to any algorithmic changes (e.g. hash consing
> or similar), which are needed for overall memory use (whereas string
> type changes would be useful for localised memory usage and lower cpu
> usage due to less conversions).
>
> What do you think?
>
> regards,
> iustin
>


On string data types

2016-06-05 Thread Iustin Pop
Hi all,

As discussed previously, I took a look at the feasibility of changing
UUID references to bytestrings or short bytestrings.  While doable,
there are two problems with this.

The first is that we don't have performance/load benchmarks that would
tell us how the overall behaviour changes after such a change.  I've
tried to look at the unittests (test/hs/htest) as a proxy, but it
doesn't give any useful measures (± a few percents).  The individual
benchmarks (e.g. getNodeInstances) do show significant improvement, but
how relevant are these?

The second one is that there are two kinds of inefficiencies: basic
costs (which are higher in terms of memory for String versus something
else) and conversion costs (e.g. needing to go back to String from
ByteString).  While basic costs are/can be high, a conversion cost in a
tight loop can be even higher (hence why getNodeInstances is very slow
right now).  So whatever data type change we do, we should make sure
that the switch is complete, so that at runtime we don't have to pay any
conversion costs (as much as possible).

Given both of these, I think that fixing stable 2.16 is risky; I think a
better approach would be to forego any current release (2.17 seems to be
in beta and has a stable branch, so not feasible either), and focus on
large scale changes in master:

- switch for parsing from String+JSON to ByteString+Aeson; this should
  give nontrivial speed and memory improvements
- decide on whether to use a single data type for both UUIDs and object
  names (e.g. Text) or use a split model (ByteStrings vs. Text or
  ShortByteString vs. Text)
- convert all object fields according to the above
- convert all internal data paths to not use String anymore

This would be orthogonal to any algorithmic changes (e.g. hash consing
or similar), which are needed for overall memory use (whereas string
type changes would be useful for localised memory usage and lower cpu
usage due to less conversions).

What do you think?

regards,
iustin