On Fri, Jun 3, 2016 at 11:25 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paqu...@gmail.com> writes: >> Actually, the docs could be more polished. > > I think the docs could stand to be rewritten from scratch ;-). But > upthread there was an offer to work on them if we made the code behavior > saner. I've done the latter part, I don't want to do the former.
I have finally given a shot at improving the docs with the attached. Comments are welcome. -- Michael
diff --git a/doc/src/sgml/bloom.sgml b/doc/src/sgml/bloom.sgml index 8667763..c1e204f 100644 --- a/doc/src/sgml/bloom.sgml +++ b/doc/src/sgml/bloom.sgml @@ -8,43 +8,38 @@ </indexterm> <para> - <literal>bloom</> is a module that implements an index access method. It comes - as an example of custom access methods and generic WAL record usage. But it - is also useful in itself. + <literal>bloom</> provides an index access method usable as a + <ulink url="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filter</ulink>. </para> - <sect2> - <title>Introduction</title> - - <para> - The implementation of a - <ulink url="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filter</ulink> - allows fast exclusion of non-candidate tuples via signatures. - Since a signature is a lossy representation of all indexed attributes, - search results must be rechecked using heap information. - The user can specify signature length in bits (default 80, maximum 4096) - and the number of bits generated for each index column (default 2, - maximum 4095). - </para> + <para> + A bloom filter is a space-efficient data structure that is used to test + whether an element is a member of a set. In the case of an index access + method, it allows fast exclusion of non-candidate tuples via signatures + whose size is calculated in bits and is controllable at index creation. + </para> - <para> - This index is useful if a table has many attributes and queries include - arbitrary combinations of them. A traditional <literal>btree</> index is - faster than a bloom index, but it can require many indexes to support all - possible queries where one needs only a single bloom index. A Bloom index - supports only equality comparison. Since it's a signature file, and not a - tree, it always must be read fully, but sequentially, so that index search - performance is constant and doesn't depend on a query. - </para> - </sect2> + <para> + A signature is a lossy representation of all indexed attributes and + so it can report false positives (it is reported that an element exists in + the set, however it does not), so search results must be rechecked using + heap information. + </para> + + <para> + This type of index is useful if a table has many attributes and queries + include arbitrary combinations of them. A traditional <literal>btree</> + index is faster than a bloom index, but it can require many indexes to + support all possible queries where one needs only a single bloom index + (default 80, maximum 4096). + </para> <sect2> <title>Parameters</title> <para> - <literal>bloom</> indexes accept the following parameters in the - <literal>WITH</> - clause. + A <literal>bloom</> index accepts the following parameters in its + <literal>WITH</> clause. </para> <variablelist> @@ -52,7 +47,8 @@ <term><literal>length</></term> <listitem> <para> - Length of signature in bits + Length of signature in bits. Default is <literal>80</> and maximum is + <literal>4096</>. </para> </listitem> </varlistentry> @@ -62,7 +58,9 @@ <term><literal>col1 — col32</></term> <listitem> <para> - Number of bits generated for each index column + Number of bits generated for each index column. Each parameter refers + to the number of the column for the relation on which an index is + defined. </para> </listitem> </varlistentry> @@ -82,91 +80,88 @@ CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3) </programlisting> <para> - Here, we created a bloom index with a signature length of 80 bits, - and attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits. + Here, a bloom index is created with a signature length of 80 bits, and + attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits. </para> <para> - Here is a fuller example of index definition and usage: + Here is a more complete example of index definition and usage, a bloom + index representing first the advantage to be more space-efficient + than an equivalent btree index: </para> <programlisting> -CREATE TABLE tbloom AS -SELECT - random()::int as i1, - random()::int as i2, - random()::int as i3, - random()::int as i4, - random()::int as i5, - random()::int as i6, - random()::int as i7, - random()::int as i8, - random()::int as i9, - random()::int as i10, - random()::int as i11, - random()::int as i12, - random()::int as i13 -FROM - generate_series(1,1000); -CREATE INDEX bloomidx ON tbloom USING - bloom (i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12); -SELECT pg_relation_size('bloomidx'); -CREATE index btree_idx ON tbloom(i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12); -SELECT pg_relation_size('btree_idx'); -</programlisting> - -<programlisting> -=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 20 AND i10 = 15; - QUERY PLAN ------------------------------------------------------------------------------------------------------------------ - Bitmap Heap Scan on tbloom (cost=1.50..5.52 rows=1 width=52) (actual time=0.057..0.057 rows=0 loops=1) - Recheck Cond: ((i2 = 20) AND (i10 = 15)) - -> Bitmap Index Scan on bloomidx (cost=0.00..1.50 rows=1 width=0) (actual time=0.041..0.041 rows=9 loops=1) - Index Cond: ((i2 = 20) AND (i10 = 15)) - Total runtime: 0.081 ms -(5 rows) +=# CREATE TABLE tbloom AS + SELECT + (random() * 1000000)::int as i1, + (random() * 1000000)::int as i2, + (random() * 1000000)::int as i3, + (random() * 1000000)::int as i4, + (random() * 1000000)::int as i5, + (random() * 1000000)::int as i6 + FROM + generate_series(1,10000000); +SELECT 10000000 +=# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6); +CREATE INDEX +=# SELECT pg_size_pretty(pg_relation_size('bloomidx')); + pg_size_pretty +---------------- + 153 MB +(1 row) +=# CREATE index btree_idx ON tbloom(i1,i2,i3,i4,i5,i6); +CREATE INDEX +=# SELECT pg_size_pretty(pg_relation_size('btree_idx')); + pg_size_pretty +---------------- + 387 MB +(1 row) </programlisting> <para> - Seqscan is slow. + A sequential scan takes a long time when concatenating multiple quals + with <literal>AND</literal>, in which case btree indexes are usually + selected by the planner: </para> - <programlisting> -=# SET enable_bitmapscan = off; -=# SET enable_indexscan = off; -=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 20 AND i10 = 15; - QUERY PLAN --------------------------------------------------------------------------------------------------- - Seq Scan on tbloom (cost=0.00..25.00 rows=1 width=52) (actual time=0.162..0.162 rows=0 loops=1) - Filter: ((i2 = 20) AND (i10 = 15)) - Total runtime: 0.181 ms -(3 rows) + =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451; + QUERY PLAN +--------------------------------------------------------------------------------------------------------------------------------- + Index Only Scan using btree_idx on tbloom (cost=0.56..298315.96 rows=1 width=24) (actual time=669.284..669.284 rows=0 loops=1) + Index Cond: ((i2 = 898732) AND (i5 = 123451)) + Heap Fetches: 0 + Planning time: 0.159 ms + Execution time: 669.330 ms + (5 rows) </programlisting> - <para> - A btree index will be not used for this query. - </para> + <para> + Bloom is better than btree in handling this kind of qual accumulation: + </para> <programlisting> -=# DROP INDEX bloomidx; -=# CREATE INDEX btree_idx ON tbloom(i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12); -=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 20 AND i10 = 15; - QUERY PLAN --------------------------------------------------------------------------------------------------- - Seq Scan on tbloom (cost=0.00..25.00 rows=1 width=52) (actual time=0.210..0.210 rows=0 loops=1) - Filter: ((i2 = 20) AND (i10 = 15)) - Total runtime: 0.250 ms -(3 rows) + =# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451; + QUERY PLAN +----------------------------------------------------------------------------------------------------------------------------- + Bitmap Heap Scan on tbloom (cost=178436.49..178440.51 rows=1 width=24) (actual time=125.279..125.279 rows=0 loops=1) + Recheck Cond: ((i2 = 898732) AND (i5 = 123451)) + Rows Removed by Index Recheck: 2391 + Heap Blocks: exact=2340 + -> Bitmap Index Scan on bloomidx (cost=0.00..178436.49 rows=1 width=0) (actual time=109.270..109.270 rows=2391 loops=1) + Index Cond: ((i2 = 898732) AND (i5 = 123451)) + Planning time: 0.490 ms + Execution time: 125.377 ms +(8 rows) </programlisting> </sect2> <sect2> - <title>Opclass interface</title> + <title>Operator Class Interface</title> <para> - The Bloom opclass interface is simple. It requires 1 supporting function: - a hash function for the indexing datatype. It provides 1 search operator: - the equality operator. The example below shows <literal>opclass</> + The Bloom opclass interface is simple as it requires one hash + function for the indexing datatype and it needs one equality operator + as search operator. The example below shows <literal>opclass</> definition for <literal>text</> datatype. </para> @@ -179,22 +174,21 @@ DEFAULT FOR TYPE text USING bloom AS </sect2> <sect2> - <title>Limitation</title> + <title>Limitations</title> <para> - <itemizedlist> <listitem> <para> - For now, only opclasses for <literal>int4</>, <literal>text</> come - with the module. However, users may define more of them. + Only operator classes for <literal>int4</> and <literal>text</> are + implemented with the module. </para> </listitem> <listitem> <para> - Only the <literal>=</literal> operator is supported for search at the - moment. But it's possible to add support for arrays with contains and - intersection operations in the future. + Only the <literal>=</literal> operator is supported for search. But + it is possible to add support for arrays with union and intersection + operations in the future. </para> </listitem> </itemizedlist>
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers