Re: hail algorithm clarification

Iustin Pop Fri, 07 Sep 2012 13:36:59 -0700

On Fri, Sep 07, 2012 at 10:55:51AM +0300, Constantinos Venetsanopoulos wrote:
> On 09/07/2012 04:02 AM, Iustin Pop wrote:
> >On Thu, Sep 06, 2012 at 02:07:50PM +0300, Constantinos Venetsanopoulos wrote:
> >>On 09/05/2012 06:09 PM, Iustin Pop wrote:
> >>>On Wed, Sep 05, 2012 at 12:54:30PM +0300, Constantinos Venetsanopoulos 
> >>>wrote:
> >>>>Hello iustin,
> >>>>
> >>>>any news on that? I assume you are quite busy these days..
> >>>Not so much busy as overloaded with many small things, so I forgot about
> >>>this.
> >>>
> >>>Looking now at the LOCAL.data you provided me, I see that running hinfo
> >>>on it shows the known issue with KVM memory reporting:
> >>Can you point me to the corresponding thread for this problem,
> >>because I think I have missed that..


It was an older thread, it's linked from
http://code.google.com/p/ganeti/issues/detail?id=127.

> >>>Cluster status:
> >>>  F Name                 t_mem n_mem i_mem  x_mem  f_mem r_mem t_dsk f_dsk 
> >>> pcpu vcpu
> >>>  - demo1.dev.grnet.gr       0     0     0      0      0     0     0     0 
> >>>    0    0
> >>>  - demo2.dev.grnet.gr       0     0     0      0      0     0     0     0 
> >>>    0    0
> >>>  - demo3.dev.grnet.gr       0     0     0      0      0     0     0     0 
> >>>    0    0
> >>>  - demo4.dev.grnet.gr       0     0     0      0      0     0     0     0 
> >>>    0    0
> >>>    demo5.dev.grnet.gr  193811   429     0    310 193072     0  3800  3800 
> >>>   24    0
> >>>    demo6.dev.grnet.gr  193811   840     0    -97 193068  1024  3800  3780 
> >>>   24    0
> >>>    demo7.dev.grnet.gr  193811   836     0    -95 193070     0  3800  3800 
> >>>   24    0
> >>>    demo8.dev.grnet.gr  193811   381     0    395 193035     0  3800  3800 
> >>>   24    0
> >>>    demo9.dev.grnet.gr  193811  1630  1024   -830 191987     0  3800  3780 
> >>>   24    1
> >>>    demo10.dev.grnet.gr 193811  1257  1024  -1100 192630     0  3800  3780 
> >>>   24    1
> >>>    demo11.dev.grnet.gr 193811  8014 22528 -24579 187848  1024  3800  3640 
> >>>   24   22
> >>>    demo12.dev.grnet.gr 193811   364     0    379 193068  1024  3800  3780 
> >>>   24    0
> >>>    demo13.dev.grnet.gr 193811  1020  1024   -878 192645     0  3800  3780 
> >>>   24    1
> >>>
> >>>It could be that having negative x_mem throws the statistics badly
> >>>off-track, but I'm not entirely sure.
> >>If this is the case, wouldn't it affect also the drbd instances too?
> >>
> >>>Oh oh, I think I know. This is debug output from an instrumented binary.
> >>>I'm using 'plain', by the way:
> >>>
> >>>"For new-0 new primary demo5.dev.grnet.gr, score: 11.598841466014958"
> >>>"For new-0 new primary demo6.dev.grnet.gr, score: 11.598841466014958"
> >>>"For new-0 new primary demo7.dev.grnet.gr, score: 11.598841466014958"
> >>>"For new-0 new primary demo8.dev.grnet.gr, score: 11.598841466014958"
> >>>"For new-0 new primary demo9.dev.grnet.gr, score: 11.598841466014958"
> >>>"For new-0 new primary demo10.dev.grnet.gr, score: 11.598841466014958"
> >>>"For new-0 new primary demo11.dev.grnet.gr, score: 11.598841466014958"
> >>>"For new-0 new primary demo12.dev.grnet.gr, score: 11.598841466014958"
> >>>"For new-0 new primary demo13.dev.grnet.gr, score: 11.598841466014958"
> >>>
> >>>Note that all scores were identical, and we took the last one
> >>>arbitrarily.
> >>>
> >>>"For new-1 new primary demo5.dev.grnet.gr, score: 11.603271353529022"
> >>>"For new-1 new primary demo6.dev.grnet.gr, score: 11.603271353529022"
> >>>"For new-1 new primary demo7.dev.grnet.gr, score: 11.603271353529022"
> >>>"For new-1 new primary demo8.dev.grnet.gr, score: 11.603271353529022"
> >>>"For new-1 new primary demo9.dev.grnet.gr, score: 11.603271353529022"
> >>>"For new-1 new primary demo10.dev.grnet.gr, score: 11.603271353529022"
> >>>"For new-1 new primary demo11.dev.grnet.gr, score: 11.603271353529022"
> >>>"For new-1 new primary demo12.dev.grnet.gr, score: 11.603271353529022"
> >>>"For new-1 new primary demo13.dev.grnet.gr, score: 11.603271353529022"
> >>>
> >>>Again the same situation, and again and again:
> >>>
> >>>"For new-11 new primary demo5.dev.grnet.gr, score: 17.804894914109706"
> >>>"For new-11 new primary demo6.dev.grnet.gr, score: 17.804894914109706"
> >>>"For new-11 new primary demo7.dev.grnet.gr, score: 17.804894914109706"
> >>>"For new-11 new primary demo8.dev.grnet.gr, score: 17.804894914109706"
> >>>"For new-11 new primary demo9.dev.grnet.gr, score: 17.804894914109706"
> >>>"For new-11 new primary demo10.dev.grnet.gr, score: 17.804894914109706"
> >>>"For new-11 new primary demo11.dev.grnet.gr, score: 17.804894914109706"
> >>>"For new-11 new primary demo12.dev.grnet.gr, score: 17.804894914109706"
> >>>"For new-11 new primary demo13.dev.grnet.gr, score: 17.804894914109706"
> >>>
> >>>So what is happening here is that the cluster is reasonably balanced,
> >>>and adding an instance results in the same score, anywhere we place it.
> >>>That's expected for instance "new-0". But after new-0, it shouldn't be
> >>>the same, so I suspect a bug in allocation of instances without
> >>>secondaries… but I don't see anything obvious when looking at the code.
> >>If I understand correctly, you feed hail with the cluster info and
> >>simulate the creation of an instance, so it can provide output?
> >>If yes, when you add "new-0":
> >>
> >>  - is the cluster empty of instances? or
> >>  - it already has the instances found in my LOCAL.data?
> >>
> >>In case of the first, the output seems reasonable when adding new-0.
> >>In case of the latter, shouldn't the scores be different also for new-0?
> >>
> >>In either case, after the addition of new-0 the score should be
> >>different indeed. So it seems we have a problem there.
> >>
> >>>I will investigate further, but it could a number of things:
> >>>
> >>>- a bug in the update of node list, which would make the most sense (but
> >>>   I don't see one)
> >>>- accumulated rounding errors, but the disk/memory values are sane
> >>>- something else
> >>>
> >>>I have to say, this is mighty interesting :)
> >>More investigation it is then...
> >>OK. I'll try to dive into a bit of Haskell too, but I'm not
> >>making any promises yet :)
> >found problem. embarrasingly trivial, unittests designed to catch it
> >failed.
> 
> :) Great! Good thing we found that then.
> 
> Thanks a lot for investigating this,

Not a problem. I was very surprised at the behaviour, and I honestly
believed that our unittests were actually testing for this.

In reality they were only testing for "allocate 1 instance in the right
place", but on an empty cluster, adding one instance anywhere will be
correct, so that test was not really meaningful. I've send patches to
extend it to test the score computed in this process (which would have
caught this issue), and to test multiple instance allocation.

regards,
iustin

Re: hail algorithm clarification

Reply via email to