[Designs - Feature #6079] Data/Model Separation - Data in (and out of) Modules

tickets Mon, 14 Mar 2011 23:05:31 -0700

Issue #6079 has been updated by Brian Gallew.


I disagree with Luke on data priority: PDL should always override class 
defaults simply from the viewpoint of never wanting to touch the class to 
change the data.

Regarding the PDL, Nigel asked for comment on the datalocation tree.  While I 
hate extra typing as much as the next guy, I think the principle of least 
surprise would indicate that the datalocation would keep the /data directories 
(omitting the files/manifests/templates/tests/etc ones).  For a further 
clarification, though, would the PDLs allows substitutions and functions?

# /ntp/data/ntp.pdl
$server = $my_puppet_master
$conffile = template("ntp/ntp.conf.erb")

While talking with Nigel I brought up the possibility of executable manifests, 
which could (and if they were implemented, *should*) be mirrored on the PDL 
side.  The idea is that there are a lot of attempts to write ugly defines so 
that we can keep our manifests relatively DRY, and that's bad.  Worse, there 
are places where it simply doesn't work (e.g. defines of defines).  I've been 
contemplating adding some code to Puppet such that when it finds a manifest to 
read, it checks the executable bit and, if set, executes the file and parses 
the output.  Likewise, an executable PDL would be executed and the output 
consumed.  Obviously, if done naively it would be hideously fragile since the 
output of the file would absolutely always have to be valid DSL.  Also, it 
would break on e.g. FAT filesystems which lack the executable bit.

Another thing to think about would be stages (though there's probably another 
ticket somewhere that better deserves this comment).  As things stand, from a 
usage standpoint Puppet feels like a one-pass parser.  As Puppet encounters 
each parseable item (resource, variable, function) it is immediately evaluated. 
 ISTR a ticket about "futures" where evaluation of any given item was delayed 
as long as possible.  This would give templated items a *lot* more utility as 
the catalog and associated variables would be much more complete.  
Alternatively, add an "include_last" function that simply queues up a class for 
inclusion rather than including it immediately.

Let me give you a use case (for pretty much everything I've talked about).  All 
of my computers fall into one of four classes (this is simplified, of course): 
production, testing, development, and services.  All of my computers use the 
AllowGroups clause in sshd_config to restrict access appropriately.  Developers 
can login to both development and testing systems, change management can login 
to both testing and production systems, and the sysadmins can, of course, login 
everywhere (not as root!).  In the ideal situation, you would only have to 
include the sshserver class once, since clearly every system needs it.  
However, if you do that, your sshd_config template will be parsed long before 
you have determined the function of a particular node.  Today you get around 
that by either including the sshserver class in every role (except for the 
roles which are inherited, then you have to *only* include it in the children) 
which is both confusing and the antithesis of DRY, or else you have some custom 
fact that runs on the client (e.g. one that parses 
/var/lib/puppet/state/classes.txt) and use that on subsequent runs to generate 
the appropriate sshd_config, thus introducing a one cycle delay for convergence.
----------------------------------------
Feature #6079: Data/Model Separation - Data in (and out of) Modules
https://projects.puppetlabs.com/issues/6079

Author: Nigel Kersten
Status: Investigating
Priority: Normal
Assignee: Nigel Kersten
Category: 
Target version: 
Keywords: 
Branch: 


# Overview #

There are (at least) two problems related to data in Puppet.

 1. The data is mixed up with the model in our manifests.
 2. Puppet modules cannot be reused because site specific settings cannot be 
specified consistently and easily outside of the module contents.

# Glossary #

These terms should not be taken to be authoritative yet, particularly 
define/declare/evaluate. They are here so we can at least be internally 
self-consistent in this document.

 * **Author** - The person who writes Puppet modules.
 * **User** - The person who is responsible for the infrastructure and consumes 
modules written by an Author. A person may be a an Author and a User.
 * **Define** - When you describe what a Class is in Puppet and what parameters 
it accepts, if any.
 * **Declare** - When you instantiate a singleton instance of a Puppet Class 
and specify values for parameters.
 * **Evaluate** - When the agent applies the catalog.
 * **Parameterized Class** - A Puppet Class that has been Defined with a 
non-zero number of parameters.


All snippets of text will have a single header line to indicate their 
filesystem path as follows:
    # /etc/puppet/puppet.conf
    [master]
    ...

# Constraints #

  1. We must attempt to solve both problems the same way where possible.
  2. It must be easy for an Author to separate model and data.
  3. It must be easy for an Author to specify data according to Author-defined 
criteria.
  4. It must be easy for a User to specify site specific data according to 
User-defined criteria.
  5. For both Users and Authors, the implementation must be as transparent as 
possible in the DSLs; the introduction of new keywords, functions and syntax is 
discouraged, but may be necessary.
  6. The solution must work with as many different ways of invoking Puppet as 
possible.
  7. Author and User specification of data must be possible as simple key/value 
pairs in a text file.
  8. Authors and Users should be able to export to the data format reasonably 
easily.
  9. The solution must not preclude User data specification in storage 
mechanisms other than text files.
  10. We must adhere to existing Puppet conventions such as the autoload 
mechanism where possible.

# Proposal #

## Authoring Modules ##

The primary aim is to solve Problem One.

Modules have a new meaningful sub-directory name, "data", and we add a new 
meaningful file extension to puppet, ".pdl" for Puppet Data Library (PDL). 
(Please don't 
[bike-shed](http://en.wikipedia.org/wiki/Parkinson's_Law_of_Triviality) on this 
name or the extension. It is not important yet.) 

In exactly the same way that the autoloader expects class ntp to be found in:
    <modulepath>/ntp/manifests/init.pp
data for class ntp is found in:
    <modulepath>/ntp/data/init.pdl

In exactly the same way that the autoloader expects class ntp::client to be 
found in:
    <modulepath>/ntp/manifests/client.pp
data for class ntp::client is found in:
    <modulepath>/ntp/data/client.pdl

When you Define a Parameterized Class, if any parameter does not have a default 
specified in the manifest, Puppet consults the PDL.

When a User Declares a Parameterized Class without specifying parameter values, 
Puppet consults the PDL, then any default values the Author has specified if 
the PDL does not provide an answer.

In all following examples, we will be retrieving data for the following classes.

    # <modulepath>/ntp/manifests/init.pp
    class ntp($server, $admin_group) { ... }

    # <modulepath>/ntp/manifests/client.pp
    class ntp::client($iburst) { ... }
    Puppet Data Library Format

### Key/Value pairs ###

$server will be set to "time.ntp.org"

    # <modulepath>/ntp/data/init.pdl
    $server = "time.ntp.org"
    $iburst will be set to true

    # <modulepath>/ntp/data/client.pdl
    $iburst = true

### Extended Syntax ###

The extended syntax allows for values to be specified based upon other 
variables such as facts about the node.

Each (assignment where conditional) is a single line to avoid the requirement 
of block markers.

If $operatingsystem equals "darwin", set $server to "time.apple.com", else if 
$operatingsystem equals "debian", set $server to "time.ntp.org", else set 
$server to "time2.ntp.org".

    # <modulepath>/ntp/data/init.pdl
    $server = "time.apple.com" where $operatingsystem == "darwin"
    $server = "time.ntp.org" where $operatingsystem == "debian" 
    $server = "time2.ntp.org"

If $hardware_type equals "laptop", set $iburst to "false", otherwise, set 
$iburst to "true"

    # <modulepath>/ntp/data/client.pdl
    $iburst = false where $hardware_type == "laptop"
    $iburst = true

If $hardware_type equals "laptop" and $domain equals "puppetlabs.lan", set 
$server to "time.puppetlabs.lan"

    # <modulepath>/ntp/data/client.pdl
    $server = "time.puppetlabs.lan" where ($hardware_type == "laptop" and 
$domain == "puppetlabs.lan)

If an Author wishes to have class ntp::client share a parameter value with 
class ntp, they specify this when they Define class ntp::client. 

    # <modulepath>/ntp/manifests/client.pp 
    class ntp::client($version=$ntp::version) { ... }

## Using Modules ##

The primary aim is to solve Problem Two.

We add a new Puppet configuration parameter, "datalocation" that can be set in 
any configuration block (see earlier reference about bike-shedding).

    # /etc/puppet/puppet.conf
    [agent]
    datalocation = /var/lib/puppet/data
    ...

    [master]
    datalocation = /var/lib/puppet/data
    ...

    [development]
    datalocation = /var/lib/puppet/environments/development/data
    ...

    [testing]
    datalocation = /var/lib/puppet/environments/testing/data
    ...

    [production]
    datalocation = /var/lib/puppet/environments/production/data
    ...

Users can create PDLs in these locations that supplant the Author-specified 
PDLs.

The autoloader expects a very similar structure as specified in Authoring 
Modules, just rooted at $datalocation rather than $modulepath, and without the 
redundant "data" sub-directory. [Particularly interested in feedback on this 
point]

    # <datalocation>/ntp/init.pdl
    $server = "time.mydomain.com"

    # <datalocation>/ntp/client.pdl
    $iburst = true

Exactly the same extended syntax can be used as specified in Authoring Modules.

In the Puppet configuration file, just as we have "manifest" which refers to 
the entry-point manifest, and that defaults to "$manifestdir/site.pp", we now 
add "datalibrary" , which defaults to "$datalocation/site.pdl" (with a final 
reminder about bike-shedding actual names).

The value of $server will still be retrieved from the PDL as specified above, 
but the value of $admin_group will be retrieved from the Site PDL.

    # <datalocation>/ntp/init.pdl
    $server = "time.mydomain.com"

    # <datalocation>/site.pdl
    $admin_group = "corp_dev"

The User can also choose to maintain a single Site PDL with no Module specific 
PDLs if they wish by fully qualifying parameters as follows:
    # <datalocation>/site.pdl
    $ntp::server = "time.mydomain.com"
    $admin_group = "corp_dev"


## Precedence Order ##

As soon as a value for a parameter is discovered, the lookup process stops for 
that parameter.

Precedence is as follows for our class ntp example:

  1. Manually specified parameter values when Declaring a Parameterized Class  
  2. `<datalocation>/ntp/init.pdl`
  3. `<datalocation>/site.pdl`
  4. `<modulepath>/ntp/data/init.pdl`
  5. Manually specified parameter values when Defining a Parameterized Class.


## Rich Data Types ##

We have several ways to encode this. We could use Ruby-style or JSON-style. 
JSON-style is going to be easier for people to write tools to populate data 
files. Arrays are the same in both cases, only Hashes differ.

Again, each of these should be assumed to be on a single line.

### Arrays ###

    $packages = [ "one", "two", "three" ] where $operatingsystem == "debian"

### Ruby-style Array of Hashes ###

    $packages = [ { "name" => "puppet", "ensure" => "installed" }, { "name" => 
"puppetmaster", "ensure" => "installed" }, ]  where $operatingsystem == "debian"

### JSON-style Array of Hashes ###

    $packages = [ { "name":"puppet", "ensure":"installed" }, { 
"name":"puppetmaster","ensure":"installed" }, ] where $operatingsystem == 
"debian"

We are not going to support setting individual rich data elements on the left 
hand side like this:
    $packages[0] = "foo" where $operatingsystem == "debian"    # not 
implementing
    $packages[1] = "bar" where $operatingsystem == "debian"    # not 
implementing
unless someone can come up with a clean and simple design.

# Other Notes #

We envisage other PDL formats other than this plain text one, such as a Ruby 
PDL that could invoke blocks to query other data sources and/or provide more 
complex conditional logic. This format should remain simple.

Under this proposal, if I reference $ntp::version from anywhere outside class 
ntp itself, it should resolve to the same value as the data lookup process 
does. 


# Potential Problems #

 * How do users easily debug the process?
   * How do they check ahead of time that parameters are getting the values 
they expect on a given node?
   * Will we eventually need a command line tool to validate?
 * Security - do we care if all nodes or some nodes etc can use the data 
contained in modules?
   * Is some form of granular security controls needed and if so how would such 
controls get implemented?


# References #

 * 
[http://www.devco.net/archives/2009/08/31/complex_data_and_puppet.php](http://www.devco.net/archives/2009/08/31/complex_data_and_puppet.php)
 * 
[http://www.lab42.it/presentations/puppetmodules/puppetmodules.html](http://www.lab42.it/presentations/puppetmodules/puppetmodules.html)
 * [http://bodepd.com/wordpress/?p=64](http://bodepd.com/wordpress/?p=64)
 * 
[https://github.com/ohadlevy/puppet-lookup](https://github.com/ohadlevy/puppet-lookup)


-- 
You have received this notification because you have either subscribed to it, 
or are involved in it.
To change your notification preferences, please click here: 
http://projects.puppetlabs.com/my/account

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Bugs" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-bugs?hl=en.

[Designs - Feature #6079] Data/Model Separation - Data in (and out of) Modules

Reply via email to