Re: [HACKERS] Reviewing new index types (was Re: [PATCHES] Updated bitmap indexpatch)

Bruce Momjian Wed, 26 Sep 2007 01:37:22 -0700

This has been saved for the 8.4 release:

        http://momjian.postgresql.org/cgi-bin/pgpatches_hold


---------------------------------------------------------------------------

Simon Riggs wrote:
> On Sat, 2007-07-21 at 12:20 +0100, Simon Riggs wrote:
> 
> > I'd like to help where I can if nobody else is currently doing this. I
> > would focus initially on some analysis of the various use cases to give
> > a better view on what we would need B-tree, clustered indexes and bitmap
> > indexes to do for us.
> 
> I've done some further analysis of bitmap indexes in preparation for a
> comparison with clustered indexes (GIT), to help understand the use
> cases for each.
> 
> Overall, my conclusion is that BMI and GIT have separate use cases,
> almost opposite use cases or at least orthogonal ones. I would
> eventually like both. BMI optimises for high numbers of rows per value,
> whilst GIT optimises for clustering of values. BMI is not useful at all
> for PKs, whilst GIT is specifically designed to handle them. Both handle
> INSERTs well, though GIT handles growing numbers of values easily, BMI
> prefers to keep the distribution more constant. GIT needs HOT to
> continue to operate effectively for long periods, whereas BMI doesn't
> seem to handle UPDATEs well at all (but more testing required on that
> one).
> 
> ---
> 
> Neither the latest bitmap index nor the latest GIT patch applied
> cleanly. The bitmap patch was close, but GIT needs an update yet to
> integrate Alexey's recent work.
> 
> My test case was a table with 10 million rows, with columns with varying
> numbers of unique values. So Ndistinct = 100 means 100,000 rows per
> value.
> 
> BITMAP INDEXES
> 
> Ndistinct     Best time       Size in blocks
> 1             10.6s           100
> 10            10.4s           102
> 100           11.7s           2002
> 1000          15.1s           6006
> 10000         19.8s           10046
> 100000                82.1s           100442
> 1000000               -               >450000
> 
> Size exactly equivalent for both Integer and Text (same values). Build
> time was similar also.
> 
> The test for 1 million distinct values didn't return after over 4 CPU
> minutes expended with the disk going crazy. After a number of minutes I
> decided to cancel the index build. Multiple cancels didn't stop the
> build, so after some more time I decided to kill it, which then crashed
> the server. Automatic restart crashed as well with a "could not find
> transaction id 0" error. Clearly some WAL-weirdness to investigate...
> 
> Overall, I'd have to say that's quite enough for me to say bitmap is not
> quite ready yet without clear health warnings. I had hopes...
> 
> B-TREE INDEXES (Integers) 
> 
> Rows/value    Best time       Size in blocks
> 10000000      49s             21899
> 1000000               49s             21899
> 100000                49s             21899
> 10000         47s             21899
> 1000          43s             21899
> 100           38s             21899
> 10            38s             21899
> 1             33s             21899
> 
> Build time for Integers shown. Build time for Text ~x5-6 times as long.
> 
> Testing against equivalent b-tree builds, the fastest b-tree build I
> could get was 33s on a unique integer index. So BMI build time is
> certainly optimised for low numbers of distinct values, but doesn't have
> any optimisation for when the BMI is built on a poor candidate column.
> GIT does degrade down to a normal b-tree when clustering isn't
> sufficient to give reduction in index size.
> 
> The cross-over point was between 10^4 and 10^5 distinct values for both
> size and build time; on that test around 100-1000 rows per value. So
> BMIs are probably still useful with varying number of rows per value,
> but overall high Ndistinct proves inefficient in both build time and
> space allocation. This isn't such a surprise since we know that b-tree
> build uses a sort-based plan whereas BMI uses a hash based plan; neither
> will win all of the time, we know that from the executor.
> 
> GIT works well even with unique indexes, since each grouped tuple covers
> a range of values. I'll re-run the tests when I can to get timings. GIT
> can compress typically down to 1-5% with clustered data, not quite as
> good as bitmap's 0.5% best.
> 
> GIT's design was to have an index that was tuned for clustered data, yet
> degrades cleanly to a standard b-tree when conditions are not right.
> This makes me think that a hybrid b-tree should be possible, even
> desirable. When the data is clustered, use the grouping technique to
> reduce he number of tuples stored and when the data is highly non-unique
> use the bitmap technique to reduce numbers of tuples. Using both
> techniques in the same index would offer even wider flexibility, since
> we'd be able to cater for real-world data more easily. Both GIT and BMI
> use bitmaps, just in different ways.
> 
> -- 
>   Simon Riggs
>   EnterpriseDB  http://www.enterprisedb.com
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: [HACKERS] Reviewing new index types (was Re: [PATCHES] Updated bitmap indexpatch)

Reply via email to