[Puppet Users] Re: ERB advice

jcbollinger Mon, 18 Apr 2011 09:15:36 -0700


On Apr 15, 9:30 am, Rob McBroom <[email protected]> wrote:
> Details below, but the essential problem is this: I need to take a list of 
> hostnames and assign an integer for each one and ensure that the integer “is 
> non-negative and is no more than three decimal digits in length”. I’d prefer 
> that the each hostname lead to the same integer every time. I’ve never 
> written anything in Ruby before yesterday, so anyone with more experience 
> have any ideas?
>
> Boring details: I’m trying to set up replication between LDAP servers. Each 
> one has to have a unique integer assigned. Details for all of our nodes are 
> stored in LDAP, so when the template is applied to a node, I can have it get 
> a list of servers that it should be syncing with by searching for something 
> like 
> “(&(classification=ldap)(environment=#{environment})(!(status=inactive*)))”.
>
> Things I’ve considered so far:
>
> 1. Just keep a counter as you loop through the list of servers and use that 
> as the ID. The problem is, if I remove a server from the list (which I am 
> planning to do soon), the numbers will be reassigned and depending on when 
> each server checks in with the Puppetmaster, there’s a good chance that two 
> servers could think they have the same ID.
>
> 2. Use regex replacements on the hostname. Our hostnames are all in the 
> format x-xxxx-ldap-01. I could just take the integer part of the name and 
> then add something to it based on the value of environment, but then I’d have 
> to assume the hostname format will never change. Political nonsense is common 
> around here, so I don’t want to rely on that.
>
> What I really need is something like an MD5 sum of each hostname, but the 
> resulting numbers are obviously way too big.


The problem with a hash such as MD5 is that collisions are only
improbable, not impossible, and only because the number possible MD5
sums is large relative to the number of items you're comparing (often
just two).  If you instead used something with results in the range
0-999, then the likelihood of collisions would be much greater.
Furthermore, the more unique results you need, the less likely it is
that such a function can provide them.

Suppose you have a hash function that produces results in the range
you specified, and you use it to compute hashes for a set of
hostnames.  If the hash is good -- the results are indistinguishable
from random numbers -- then the chance of at least one collision is
0.1% for two host names, 1% for five, and 10% for 15.  It goes up
increasingly rapidly from there, and it could be (much) worse if the
hash isn't so good.  MD5 is a pretty good hash, though, and if you
were willing to live with the above risk of collisions then you could
just compute an MD5, convert it to decimal, and take the last three
digits.

What you seem to want, though, is a so-called "perfect hash" with an
output in the range 0-999 decimal.  That is only possible if the
number of allowed inputs (hostnames in your case) is at most 1000,
else there must be at least two names with the same hash.  That's not
so good for you, because 1000 slots doesn't accommodate even the full
space of two-character, case-insensitive, alphanumeric ASCII strings.
(If you stuck to alpha-only, though, then you could cover at least the
one- and two-character names.)  If you have some sort of systematic
naming scheme, though, then you might be okay.

If among your original ideas and my suggestions so far you still don't
see a satisfactory answer, then you're going to have to relax your
requirements.  Perhaps the requirement to rely only on the hostname is
a good target: if you can additionally rely on an external table of
what numbers have previously been assigned to what names, then you can
get mostly what you want.  You look up the code numbers for hostnames
in that table, and assign the next available number whenever a number
is requested for a new hostname.  The table could reside in a
database, but it just as easily might reside in a plain file.

Alternatively, you can reduce -- but not eliminate -- the likelihood
of collisions from a hash function by increasing the number of digits
of the result you use.

Or perhaps you can live with the possibility that nodes will
occasionally be confused about what their correct number is.


John

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-users?hl=en.

[Puppet Users] Re: ERB advice

Reply via email to