Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table

Michael Paquier Mon, 06 Jun 2016 22:36:06 -0700

On Fri, Jun 3, 2016 at 11:25 PM, Tom Lane <[email protected]> wrote:
> Michael Paquier <[email protected]> writes:
>> Actually, the docs could be more polished.
>
> I think the docs could stand to be rewritten from scratch ;-).  But
> upthread there was an offer to work on them if we made the code behavior
> saner.  I've done the latter part, I don't want to do the former.


I have finally given a shot at improving the docs with the attached.
Comments are welcome.
-- 
Michael

diff --git a/doc/src/sgml/bloom.sgml b/doc/src/sgml/bloom.sgml
index 8667763..c1e204f 100644
--- a/doc/src/sgml/bloom.sgml
+++ b/doc/src/sgml/bloom.sgml
@@ -8,43 +8,38 @@
  </indexterm>
 
  <para>
-  <literal>bloom</> is a module that implements an index access method.  It comes
-  as an example of custom access methods and generic WAL record usage.  But it
-  is also useful in itself.
+  <literal>bloom</> provides an index access method usable as a
+  <ulink url="http://en.wikipedia.org/wiki/Bloom_filter";>Bloom filter</ulink>.
  </para>
 
- <sect2>
-  <title>Introduction</title>
-
-  <para>
-   The implementation of a
-   <ulink url="http://en.wikipedia.org/wiki/Bloom_filter";>Bloom filter</ulink>
-   allows fast exclusion of non-candidate tuples via signatures.
-   Since a signature is a lossy representation of all indexed attributes,
-   search results must be rechecked using heap information.
-   The user can specify signature length in bits (default 80, maximum 4096)
-   and the number of bits generated for each index column (default 2,
-   maximum 4095).
-  </para>
+ <para>
+  A bloom filter is a space-efficient data structure that is used to test
+  whether an element is a member of a set.  In the case of an index access
+  method, it allows fast exclusion of non-candidate tuples via signatures
+  whose size is calculated in bits and is controllable at index creation.
+ </para>
 
-  <para>
-   This index is useful if a table has many attributes and queries include
-   arbitrary combinations of them.  A traditional <literal>btree</> index is
-   faster than a bloom index, but it can require many indexes to support all
-   possible queries where one needs only a single bloom index.  A Bloom index
-   supports only equality comparison.  Since it's a signature file, and not a
-   tree, it always must be read fully, but sequentially, so that index search
-   performance is constant and doesn't depend on a query.
-  </para>
- </sect2>
+ <para>
+  A signature is a lossy representation of all indexed attributes and
+  so it can report false positives (it is reported that an element exists in
+  the set, however it does not), so search results must be rechecked using
+  heap information.
+ </para>
+ 
+ <para>
+  This type of index is useful if a table has many attributes and queries
+  include arbitrary combinations of them.  A traditional <literal>btree</>
+  index is faster than a bloom index, but it can require many indexes to
+  support all possible queries where one needs only a single bloom index
+  (default 80, maximum 4096).
+ </para>
 
  <sect2>
   <title>Parameters</title>
 
   <para>
-   <literal>bloom</> indexes accept the following parameters in the
-   <literal>WITH</>
-   clause.
+   A <literal>bloom</> index accepts the following parameters in its
+   <literal>WITH</> clause.
   </para>
 
    <variablelist>
@@ -52,7 +47,8 @@
     <term><literal>length</></term>
     <listitem>
      <para>
-      Length of signature in bits
+      Length of signature in bits. Default is <literal>80</> and maximum is
+      <literal>4096</>.
      </para>
     </listitem>
    </varlistentry>
@@ -62,7 +58,9 @@
     <term><literal>col1 &mdash; col32</></term>
     <listitem>
      <para>
-      Number of bits generated for each index column
+      Number of bits generated for each index column. Each parameter refers
+      to the number of the column for the relation on which an index is
+      defined.
      </para>
     </listitem>
    </varlistentry>
@@ -82,91 +80,88 @@ CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3)
 </programlisting>
 
   <para>
-   Here, we created a bloom index with a signature length of 80 bits,
-   and attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits.
+   Here, a bloom index is created with a signature length of 80 bits, and
+   attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits.
   </para>
 
   <para>
-   Here is a fuller example of index definition and usage:
+   Here is a more complete example of index definition and usage, a bloom
+   index  representing first the advantage to be more space-efficient
+   than an equivalent btree index:
   </para>
 
 <programlisting>
-CREATE TABLE tbloom AS
-SELECT
-    random()::int as i1,
-    random()::int as i2,
-    random()::int as i3,
-    random()::int as i4,
-    random()::int as i5,
-    random()::int as i6,
-    random()::int as i7,
-    random()::int as i8,
-    random()::int as i9,
-    random()::int as i10,
-    random()::int as i11,
-    random()::int as i12,
-    random()::int as i13
-FROM
-    generate_series(1,1000);
-CREATE INDEX bloomidx ON tbloom USING
-             bloom (i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12);
-SELECT pg_relation_size('bloomidx');
-CREATE index btree_idx ON tbloom(i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12);
-SELECT pg_relation_size('btree_idx');
-</programlisting>
-
-<programlisting>
-=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 20 AND i10 = 15;
-                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------
- Bitmap Heap Scan on tbloom  (cost=1.50..5.52 rows=1 width=52) (actual time=0.057..0.057 rows=0 loops=1)
-   Recheck Cond: ((i2 = 20) AND (i10 = 15))
-   ->  Bitmap Index Scan on bloomidx  (cost=0.00..1.50 rows=1 width=0) (actual time=0.041..0.041 rows=9 loops=1)
-         Index Cond: ((i2 = 20) AND (i10 = 15))
- Total runtime: 0.081 ms
-(5 rows)
+=# CREATE TABLE tbloom AS
+   SELECT
+     (random() * 1000000)::int as i1,
+     (random() * 1000000)::int as i2,
+     (random() * 1000000)::int as i3,
+     (random() * 1000000)::int as i4,
+     (random() * 1000000)::int as i5,
+     (random() * 1000000)::int as i6
+   FROM
+  generate_series(1,10000000);
+SELECT 10000000
+=# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);
+CREATE INDEX
+=# SELECT pg_size_pretty(pg_relation_size('bloomidx'));
+ pg_size_pretty
+----------------
+ 153 MB
+(1 row)
+=# CREATE index btree_idx ON tbloom(i1,i2,i3,i4,i5,i6);
+CREATE INDEX
+=# SELECT pg_size_pretty(pg_relation_size('btree_idx'));
+ pg_size_pretty
+----------------
+ 387 MB
+(1 row)
 </programlisting>
 
   <para>
-   Seqscan is slow.
+   A sequential scan takes a long time when concatenating multiple quals
+   with <literal>AND</literal>, in which case btree indexes are usually
+   selected by the planner:
   </para>
-
 <programlisting>
-=# SET enable_bitmapscan = off;
-=# SET enable_indexscan = off;
-=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 20 AND i10 = 15;
-                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
- Seq Scan on tbloom  (cost=0.00..25.00 rows=1 width=52) (actual time=0.162..0.162 rows=0 loops=1)
-   Filter: ((i2 = 20) AND (i10 = 15))
- Total runtime: 0.181 ms
-(3 rows)
+ =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
+                                                           QUERY PLAN
+---------------------------------------------------------------------------------------------------------------------------------
+ Index Only Scan using btree_idx on tbloom  (cost=0.56..298315.96 rows=1 width=24) (actual time=669.284..669.284 rows=0 loops=1)
+   Index Cond: ((i2 = 898732) AND (i5 = 123451))
+   Heap Fetches: 0
+ Planning time: 0.159 ms
+ Execution time: 669.330 ms
+ (5 rows)
 </programlisting>
 
- <para>
-  A btree index will be not used for this query.
- </para>
+  <para>
+   Bloom is better than btree in handling this kind of qual accumulation:
+  </para>
 
 <programlisting>
-=# DROP INDEX bloomidx;
-=# CREATE INDEX btree_idx ON tbloom(i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12);
-=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 20 AND i10 = 15;
-                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
- Seq Scan on tbloom (cost=0.00..25.00 rows=1 width=52) (actual time=0.210..0.210 rows=0 loops=1)
-   Filter: ((i2 = 20) AND (i10 = 15))
- Total runtime: 0.250 ms
-(3 rows)
+ =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
+                                                         QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------------------
+ Bitmap Heap Scan on tbloom  (cost=178436.49..178440.51 rows=1 width=24) (actual time=125.279..125.279 rows=0 loops=1)
+   Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
+   Rows Removed by Index Recheck: 2391
+   Heap Blocks: exact=2340
+   ->  Bitmap Index Scan on bloomidx  (cost=0.00..178436.49 rows=1 width=0) (actual time=109.270..109.270 rows=2391 loops=1)
+         Index Cond: ((i2 = 898732) AND (i5 = 123451))
+ Planning time: 0.490 ms
+ Execution time: 125.377 ms
+(8 rows)
 </programlisting>
  </sect2>
 
  <sect2>
-  <title>Opclass interface</title>
+  <title>Operator Class Interface</title>
 
   <para>
-   The Bloom opclass interface is simple.  It requires 1 supporting function:
-   a hash function for the indexing datatype.  It provides 1 search operator:
-   the equality operator.  The example below shows <literal>opclass</>
+   The Bloom opclass interface is simple as it requires one hash
+   function for the indexing datatype and it needs one equality operator
+   as search operator. The example below shows <literal>opclass</>
    definition for <literal>text</> datatype.
   </para>
 
@@ -179,22 +174,21 @@ DEFAULT FOR TYPE text USING bloom AS
  </sect2>
 
  <sect2>
-  <title>Limitation</title>
+  <title>Limitations</title>
   <para>
-
    <itemizedlist>
     <listitem>
      <para>
-      For now, only opclasses for <literal>int4</>, <literal>text</> come
-      with the module.  However, users may define more of them.
+      Only operator classes for <literal>int4</> and <literal>text</> are
+      implemented with the module.
      </para>
     </listitem>
 
     <listitem>
      <para>
-      Only the <literal>=</literal> operator is supported for search at the
-      moment.  But it's possible to add support for arrays with contains and
-      intersection operations in the future.
+      Only the <literal>=</literal> operator is supported for search.  But
+      it is possible to add support for arrays with union and intersection
+      operations in the future.
      </para>
     </listitem>
    </itemizedlist>

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #14155: bloom index error with unlogged table

Reply via email to