On 09/07/2014 07:09 PM, Emre Hasegeli wrote:

I updated the patch to cover semi and anti joins with eqjoinsel_semi(). I think it is better than returning a constant.What you did there is utterly unacceptable from a modularity standpoint; and considering that the values will be nowhere near right, the argument that "it's better than returning a constant" seems pretty weak. I think you should just take that out again.I will try to come up with a better, data type specific implementation in a week.New version with semi join estimation function attached.

`Thanks. Overall, my impression of this patch is that it works very well.`

`But damned if I understood *how* it works :-). There's a lot of`

`statistics involved, and it's not easy to see why something is`

`multiplied by something else. I'm adding comments as I read through it.`

I've gotten to the inet_semi_join_selec function:

/* * Inet semi join selectivity estimation. */ static Selectivity inet_semi_join_selec(bool mcv2_exists, Datum *mcv2_values, int mcv2_nvalues, bool his2_exists, Datum *his2_values, int his2_nvalues, double his2_weight, Datum *constvalue, FmgrInfo *proc, short opr_order) { if (mcv2_exists) { int i; for (i = 0; i < mcv2_nvalues; i++) { if (DatumGetBool(FunctionCall2Coll(proc, DEFAULT_COLLATION_OID, *constvalue, mcv2_values[i]))) return 1.0; } } /* Do not bother if histogram weight is smaller than 0.1. */ if (his2_exists && his2_weight > 0.1) { Selectivity his_selec; his_selec = inet_his_inclusion_selec(his2_values, his2_nvalues, constvalue, opr_order); if (his_selec > 0) return Min(1.0, his2_weight * his_selec); } return 0.0; }

`This desperately needs comment at the top of the function explaining`

`what it does. Let me try to explain what I think it does:`

`This function calculates the probability that there is at least one row`

`in table B, which satisfies the "constant op column" qual. The constant`

`is passed as argument, and for table B, the MCV list and histogram is`

`provided. his2_weight is the total number of rows in B that are covered`

`by the histogram. For example, if the table has 1000 rows, and 10% of`

`the rows in the table are in the MCV, and another 10% are NULLs,`

`his_weight would be 800.`

`First, we check if the constant matches any of the most common values.`

`If it does, return 1.0, because then there is surely a match.`

`Next, we use the histogram to estimate the number of rows in the table`

`that matches the qual. If it amounts to more than 1 row, we return 1.0.`

`If it's between 0.0 and 1.0 rows, we return that number as the probability.`

`Now, I think that last step is wrong. Firstly, the "Do not bother if`

`histogram weight is smaller than 0.1" rule seems bogus. The his2_weight`

`is the total number of rows represented by the histogram, so surely it`

`can't be less than 1. It can't really be less than the statistics`

`target. Unless maybe if the histogram was collected when the table was`

`large, but it has since shrunk to contain only a few rows, but that`

`seems like a very bizarre corner case. At least it needs more comments`

`explaining what the test is all about, but I think we should just always`

`use the histogram (if it's available).`

`Secondly, if we estimate that there is on average 1.0 matching row in`

`the table, it does not follow that the probability that at least one row`

`matches is 1.0. Assuming a gaussian distribution with mean 1.0, the`

`probability that at least one row matches is 0.5. Assuming a gaussian`

`distribution here isn't quite right - I guess a Poisson distribution`

`would be more accurate - but it sure doesn't seem right as it is.`

`The error isn't very big, and perhaps you don't run into that very`

`often, so I'm not sure what the best way to fix that would be. My`

`statistics skills are a bit rusty, but I think the appropriate way would`

`be to apply the Poisson distribution, with the estimated number of`

`matched rows as the mean. The probability of at least one match would be`

`the cumulative distribution function at k=1. It sounds like overkill, if`

`this is case occurs only rarely. But then again, perhaps it's not all`

`that rare.`

`That said, I can't immediately find a test case where that error would`

`matter. I tried this:`

create table inettbl1 (a inet);

`insert into inettbl1 select '10.0.0.' || (g % 255) from`

`generate_series(1, 10) g;`

analyze inettbl1;

`explain analyze select count(*) from inettbl1 where a >>= ANY (SELECT a`

`from inettbl1);`

`The estimate for that is pretty accurate, 833 rows estimated vs 1000`

`actual, with the current patch. I'm afraid if we fixed`

`inet_semi_join_selec the way I suggest, the estimate would be smaller,`

`i.e. more wrong. Is there something else in the estimates that`

`accidentally compensates for this currently?`

Thoughts? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers