On 12/12/13 08:14, Gavin Flower wrote:

On 12/12/13 07:22, Gavin Flower wrote:On 12/12/13 06:22, Tom Lane wrote:I wrote:Hm. You can only take N rows from a block if there actually are atleastN rows in the block. So the sampling rule I suppose you are using is "select up to N rows from each sampled block" --- and that is going to favor the contents of blocks containing narrower-than-average rows.Oh, no, wait: that's backwards. (I plead insufficient caffeine.) Actually, this sampling rule discriminates *against* blocks with narrower rows. You previously argued, correctly I think, that sampling all rows on each page introduces no new bias because row width cancels out across all sampled pages. However, if you just include up to N rows from each page, then rows on pages with more than N rows have a lower probability of being selected, but there's no such bias against wider rows. This explains why you saw smaller values of "i" being undersampled.Had you run the test series all the way up to the max number of tuples per block, which is probably a couple hundred in this test, I think you'd have seen the bias go away again. But the takeaway point is that we have to sample all tuples per page, not just a limited number of them, if we want to change it like this. regards, tom laneSurely we want to sample a 'constant fraction' (obviously, inpractice you have to sample an integral number of rows in a page!) ofrows per page? The simplest way, as Tom suggests, is to use all therows in a page.However, if you wanted the same number of rows from a greater numberof pages, you could (for example) select a quarter of the rows fromeach page. In which case, when this is a fractional number: take theintegral number of rows, plus on extra row with a probability equalto the fraction (here 0.25).Either way, if it is determined that you need N rows, then keepselecting pages at random (but never use the same page more thanonce) until you have at least N rows.Cheers, GavinYes the fraction/probability, could actually be one of: 0.25, 0.50, 0.75.But there is a bias introduced by the arithmetic average size of therows in a page. This results in block sampling favouring large rows,as they are in a larger proportion of pages.For example, assume 1000 rows of 200 bytes and 1000 rows of 20 bytes,using 400 byte pages. In the pathologically worst case, assumingmaximum packing density and no page has both types: the large rowswould occupy 500 pages and the smaller rows 50 pages. So if oneselected 11 pages at random, you get about 10 pages of large rows andabout one for small rows! In practice, it would be much less extreme- for a start, not all blocks will be fully packed, most blocks wouldhave both types of rows, and there is usually greater variation in rowsize - but still a bias towards sampling larger rows. So somehow,this bias needs to be counteracted.Cheers, Gavin

`Actually, I just thought of a possible way to overcome the bias towards`

`large rows.`

1. Calculate (a rough estimate may be sufficient, if not too 'rough') the size of the smallest row. 2. Select a page at random (never selecting the same page twice) 3. Then select rows at random within the page (never selecting the same row twice). For each row selected, accept it with the probability equal to (size of smallest row)/(size of selected row). I think you find that will almost completely offset the bias towards larger rows! 4. If you do not have sufficient rows, and you still have pages not yet selected, goto 2

`Note that it will be normal for for some pages not to have any rows`

`selected, especially for large tables!`

Cheers, Gavin P.S.

`I really need to stop thinking about this problem, and get on with my`

`assigned project!!!`