Re: [Puppet-dev] Re: A question about numbers and representation

John Bollinger Fri, 05 Sep 2014 13:12:19 -0700

Thanks again, Ken.

On Thursday, September 4, 2014 2:14:25 PM UTC-5, Ken Barber wrote:
>
> > Thanks, Ken.  Could you devote a few words to how PuppetDB chooses which 
> of 
> > those alternative columns to use for any particular value, and how it 
> > afterward tracks which one has been used? 
>
> So PuppetDB, in particular fact-contents, and the way it stores leaf 
> values makes a decision using a very basic forensic function in 
> clojure: 
>
>
> https://github.com/puppetlabs/puppetdb/blob/master/src/com/puppetlabs/puppetdb/facts.clj#L114-L124
>  
>
> We store that ID, which really maps to another lookup table (more for 
> referential integrity purposes than anything). We also use that ID to 
> make the decision as to which column we use: 
> value_string/integer/float/boolean/null etc. 
>
>


I am not very fluent in clojure, but it looks like that scheme could easily 
be extended to support Bigs.

 

> > I'm also curious about whether index efficiency in PostgreSQL (as an 
> > example) takes a significant hit just from an index being defined on a 
> > Numeric/Decimal column or whether the impact depends strongly on the 
> number 
> > of non-NULL values in that column. 
>
> It takes a hit because it requires more storage, and I believe in 
> theory the index optimisation around decimals is different also (but I 
> don't have the real numbers on that) ... integers are more commonly 
> optimised in their code base (pg's that is) because of their 
> wide-spread use in id columns. Again the same is true for smallints 
> versus bigints. This is of course conjecture without perf numbers to 
> back it to a degree, but I believe I'm probably correct. Of course 
> when it comes time to analyse this closer we'll prove it with real 
> perf tests as we usually do :-). 
>
>

Absolutely nothing beats *bona fide* tests :-).

 

> > Additionally, I'm curious about how (or whether) the alternative column 
> > approach interacts with queries.  Do selection predicates against Any 
> type 
> > values typically need to consider multiple (or all) of the value 
> columns? 
>
> Its kind of interesting, and largely pivots on operator. For 
> structured facts and the <, >, <=, >= operators ... we are forced to 
> interrogate both the integer and float columns (an OR clause in SQL 
> basically), because a user would presume thats how it worked.



I suspected as much.

 

> In a way 
> this is a coercement. If we introduced a decimal, we would have to do 
> the same again, especially if it was an overflow where its other 
> related numbers are still integers. In theory (and needs to be backed 
> with perf numbers) even while we do this, the decimal column should in 
> theory be sparser than the integer column (and therefore quicker 
> overall to traverse), we could have our cake and eat it too with just 
> a mild perf hit. If they were all decimals, then we are locking 
> ourselves into the performance of decimal for all numbers. 
>
>

I have been supposing that both in Ruby and in PuppetDB, the numeric 
representation would be the smallest / most efficient one that could 
accommodate the value without loss of fidelity.  That would mean any 
additional column supporting Big values would likely be very sparsely 
populated indeed, and also that PuppetDB could conceivably be clever enough 
to avoid checking that column at all for some queries.  For example, a 
query with criteria  (val >= 1 and val < 100) excludes all values that 
would need to be represented in a Big format).  Dunno whether that would 
help much.

 

> Another example ... the ~ operator only works on strings of course, 
> and in fact benefits from the new trgm indexes we've started to 
> introduce in 9.3. This wasn't as simple before, the normal index for a 
> text column actually has a maximum size limit ... this is a fact not 
> many devs realise. It also isn't used for regexp queries :-). But I 
> digress ... 
>
> So in short, we hide this internal float/integer comparison problem 
> from the user. In my mind, they are both "numbers" and we try to 
> expose it that way, but internally they are treated with the most 
> optimal storage we can provide. 
>
>

That's pretty much what I hoped and expected.  I am supposing that the same 
philosophy could be extended to cover Big values without too much 
difficulty, but I guess the question of how costly that might be would need 
to be determined by testing.


John

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-dev/2f71ca05-3cb1-4f59-bcb1-7144c4cd955a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [Puppet-dev] Re: A question about numbers and representation

Reply via email to