I'm less concerned with the performance ramifications or specifics of schema design of my proposal than the fact that I don't want to tie the architecture of the inventory service to storeconfigs, but performace concerns are addressed inline farther down this email.
On Mon, Feb 28, 2011 at 5:43 AM, Ohad Levy <[email protected]> wrote: > I don't see the benefit in changing the schema, besides adding duplicate > data into the database. > granted, there are more than one way to design a schema, but the current > design is a valid one. > I'm more than happy to supply the required finder methods for querying the > facts, as I use them within Foreman. I agree it's a valid schema, and I don't want to bikeshed over schema details. What more concerns me is that I'm not sure how the storeconfigs tables are currently used in the wild, and at the moment the active record models to the tables are essentially the API that people may have built inventory-like services upon (like the Foreman). We want to design the inventory service in such a way that the API is at the indirector level (essentially RESTful), so shouldn't care about implementation details of the schema, but needs to care if we build on top of the tables storeconfigs already uses. I may be making too big a deal about this since the data model for storing facts is relatively simple and it might not need that much architectural flexibility, I'm just hesitant to build on top of existing code when I'm not sure how it's depended upon and where it might introduce issues with running two puppet "services" at the same time. On Fri, Feb 25, 2011 at 3:27 PM, Markus Roberts <[email protected]> wrote: > M -- > >> Currently the storeconfigs tables dealing with facts look something >> like this (I've removed the columns that are irrelevant to the >> inventory service): >> >> create_table :hosts do |t| >> t.column :name, :string, :null => false >> end >> >> create_table :fact_names do |t| >> t.column :name, :string, :null => false >> end >> >> create_table :fact_values do |t| >> t.column :value, :text, :null => false >> t.column :fact_name_id, :integer, :null => false >> t.column :host_id, :integer, :null => false >> end >> >> I propose something more like: >> >> create_table :nodes do |t| >> t.column :name, :string, :null => false >> t.column :timestamp, :datetime >> end >> >> create_table :facts do |t| >> t.column :name, :string, :null => false >> t.column :value, :text, :null => false >> t.column :node_id, :integer, :null => false >> end >> >> It's less normalized than the storeconfigs schema since fact names >> will be duplicated per node, but easier to understand and work with, >> and I think better satisfies the types of queries we will be doing >> which are of the form "select nodes where fact equal to value". The >> more normalized schema would be better for queries of the form "select >> all values for fact", but I don't think that's something we'll be >> doing. Correct me if I'm wrong. > > I'm not saying that you're wrong, but I'm not seeing how the denormalized > table would be easier to work with; and if anything I would expect that it > would be slower to query (and slower to update) in the denormalized form. > If you index on [fact_id,value] the query should be blazingly fast, and this > would be easier/cheaper to maintain than a key on [name,value]...or am I > missing something? In general I find fewer tables easier to work with, and the active record syntax stays a little nicer too (I know you can pretty up the syntax by wrapping some methods around things, but if you don't have to, working with the data is more intuitive in my opinion): Facts.where {|fact| fact.name => 'puppetversion', fact.value => '2.6.5'}.map {|fact| fact.node.timestamp} NormalizedFacts.where (:fact_name_id => FactName.find_by_name('puppetversion').id, fact.value => '2.6.5').map {|fact| fact.node.metadata.where(:name => 'timestamp') } I don't see that storing the fact names in a seprate table would help much. The maintenance of an extra index for fact_name_id and the need to do a join or extra activerecord lookup would probably make speed about equivalent for queries as just storing the fact names denormalized. In fact, it should be fairly trivial to test that out. Here's a little repo that sets up both schemas and some test data (100 nodes with 100 facts each): https://github.com/mmrobins/Denormalization-Permance-Test The performance of inserting data into the normalized tables is actually a bit slower, I think because you need to do do a separate database query to get back the id of the fact name before you can insert the fact: == PopulateTestData: migrating =========== denormalized 26.950000 1.080000 28.030000 ( 39.207524) normalized 34.860000 1.390000 36.250000 ( 49.772006) == PopulateTestData: migrated (89.2669s) === Searching for facts looks pretty much identical if I do a separate query for the fact name: Benchmark.measure { 100.times { |i| Fact.where("name = ? AND value = ?", "fact_#{i}", "fact_value_#{i}").map {|f| f.node.name } } } => #<Benchmark::Tms:0x103b27a38 @cutime=0.0, @total=5.46, @label="", @stime=0.33, @real=6.2 3462510108948, @utime=5.13, @cstime=0.0> Benchmark.measure { 100.times {|i| NormalizedFact.where("fact_name_id = ? AND value = ?", FactName.find_by_name("fact_#{i}"), "fact_value_#{i}").map {|f| f.node.name } } } => #<Benchmark::Tms:0x10374d228 @cutime=0.0, @total=5.41999999999999, @label="", @stime=0.319999999999999, @real=6.22925186157227, @utime=5.09999999999999, @cstime=0.0> and surprising to me the normalized table a little slower if I do a join and one query Benchmark.measure { 100.times {|i| NormalizedFact.find(:all, :include => :fact_name, :conditions => ["fact_names.name = ? AND value = ?", "fact_#{i}", "fact_value_#{i}"]).map {|f| f.node.name } } } => #<Benchmark::Tms:0x103954828 @cutime=0.0, @total=6.59000000000001, @label="", @stime=0.33, @real=7.36611413955688, @utime=6.26000000000001, @cstime=0.0> >> Other benefits of the proposed schema include the "metadata" about >> each fact set being columns on the node table (Nick has also proposed >> that table be called fact_sets and have a column called node_name) >> instead of being stored as a fact. > > So completely denormalized? Why? You mean as opposed to having a separate table for metadata? (note storeconfigs timestamp metadata is stored as a fact with a serialized symbol in yaml as the fact name) Mainly because we currently only have one piece of metadata we store, and as I showed in the above example, the syntax is a bit nicer. If there's more metadata about fact sets that people can think of, I'd be happy to normalize the data. We could do it ahead of time, but why bother? Should be just as easy to do at a later date. Matt -- You received this message because you are subscribed to the Google Groups "Puppet Developers" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/puppet-dev?hl=en.
