I'm working on a dating/personals/match-making site, that has used many 
different methods of "match-making", that all seem to be very slow. One I am 
attempting now that seems to be an efficient method of storage, but not the 
best for indexing, is using bitwise operators to compare one person's profile 
to another's.

This method seems to be very fast on a small scale, but I am dealing with a 
large user-base here, in excess of 200,000 users that will be executing this 
search function every time they login (the search results of their profile 
will appear on the main page after they have logged in). I've opted to use 
"label tables" for each possible set of answers. (i.e: Marital Status)

For this table, the set of bits -- bit(5) -- are represented as such:

| Bit | Meaning    |
| 1   | single     |
| 2   | separated  |
| 3   | married    |
| 4   | divorced   |
| 5   | widowed    |

Here's the structure of the marital status table:

# \d marital_matrix                                                                    
Table "public.marital_matrix"
  Column   |      Type      |                               Modifiers                  
 member_id | integer        | not null default 
 status    | bit varying(5) | not null default (0)::bit(5)
 p_status  | bit varying(5) | not null default (0)::bit(5)
    "marital_matrix_pkey" PRIMARY KEY, btree (member_id)
    "idx_marital_matrix" btree ((status::"bit" & p_status::"bit"))
    "idx_marital_matrix_single" btree ((status::"bit" & p_status::"bit"))
    "idx_marital_p_status" btree (p_status)
    "idx_marital_status" btree (status)
Foreign-key constraints:
    "$1" FOREIGN KEY (member_id) REFERENCES members(member_id) ON DELETE 

To give you an idea of the selectivity (NOTE: there are only 50,000 rows, a 
smaller sample than what I will actually be using):

datingsite=> select count(*),status,p_status from marital_matrix group by 
 count | status | p_status 
    89 | 00001  | 00000
  1319 | 00010  | 00000
  2465 | 00100  | 00000
     1 | 00100  | 11111
 46117 | 10000  | 00000

here is the user I'll be comparing against, which has selected that he be 
matched with any but married people:

datingsite=> SELECT * FROM marital_matrix WHERE member_id = 21;
 member_id | status | p_status 
        21 | 10000  | 11011
(1 row)

Here are a few possible comparison methods I can think of (NOTE: tests were 
run on a 2.4Ghz Intel CPU w/ 256M RAM on FreeBSD 4.10:

METHOD 1: Any bit that is set in p_status (prefered marital status) of the 
searching user should be set in the potential match's marital status. This is 
the method I'd like to improve, if possible. Running the query twice didn't 
produce a different runtime.

    marital_matrix m1, marital_matrix m2
    m1.member_id = 21 AND
    m2.status & m1.p_status != B'00000';
                                                                QUERY PLAN             
 Nested Loop  (cost=0.00..2357.79 rows=49742 width=4) (actual 
time=18.062..708.469 rows=47525 loops=1)
   Join Filter: ((("inner".status)::"bit" & ("outer".p_status)::"bit") <> 
   ->  Index Scan using marital_matrix_pkey on marital_matrix m1  
(cost=0.00..5.01 rows=1 width=9) (actual time=0.035..0.045 rows=1 loops=1)
         Index Cond: (member_id = 21)
   ->  Seq Scan on marital_matrix m2  (cost=0.00..1602.91 rows=49991 width=13) 
(actual time=17.966..255.529 rows=49991 loops=1)
 Total runtime: 905.694 ms
(6 rows)

METHOD 2: Specifying the value (I don't think this would make a difference, 
but I'll post anyways):

    status & B'11011' != B'00000';

                                                      QUERY PLAN                       
 Seq Scan on marital_matrix  (cost=0.00..1852.87 rows=49742 width=4) (actual 
time=18.113..281.101 rows=47525 loops=1)
   Filter: (((status)::"bit" & B'11011'::"bit") <> B'00000'::"bit")
 Total runtime: 480.836 ms
(3 rows)

METHOD 3: Checking for one bit only. This is definitely not a "real world" 
example and unacceptable since the p_status column can and will have multiple 
bits. For categories other than "Marital Status", such as "Prefered Hair 
Color", the users are likely to select multiple bits (they choose all that 
apply). This query does use the index, but is still not very fast at all:

    marital_matrix m1
    status & B'10000' = B'10000';
 Index Scan using idx_marital_matrix_single on marital_matrix m1  
(cost=0.00..903.59 rows=250 width=4) (actual time=0.042..258.907 rows=46117 
   Index Cond: (((status)::"bit" & B'10000'::"bit") = B'10000'::"bit")
 Total runtime: 451.162 ms
(3 rows)

METHOD 4: Using an IN statement. This method seems to be very fussy about 
using the index, and I have at some point made it use the index when there 
are less than 3 possibilites. Also, for fields other than Marital Status, 
users will be able to select many bits for their own profile, which means 
there would be many permutations:

    status & B'11011' IN (B'10000',B'01000');
          QUERY PLAN                                                                   
 Seq Scan on marital_matrix  (cost=0.00..2602.73 rows=993 width=4) (actual 
time=17.845..288.279 rows=47525 loops=1)
   Filter: ((((status)::"bit" & B'11011'::"bit") = B'10000'::"bit") OR 
(((status)::"bit" & B'11011'::"bit") = B'01000'::"bit") OR (((status)::"bit" 
& B'11011'::"bit") = B'00010'::"bit") OR (((status)::"bit" & B'11011'::"bit") 
= B'00001'::"bit"))
 Total runtime: 488.651 ms
(3 rows)

Method 3 is the only one that used the index, but the only really acceptable 
method here is Method 1.

My questions are...
- Is there any hope in getting this to use an efficient index?
- Any mathmaticians know if there is a way to reorder my bitwise comparison to 
have the operator use = and not an != (perhaps to force an index)? (AFAIK, 
the answer to the second question is no)

If anyone could offer any performance tips here I'd really appreciate it. I 
imagine that having this schema wouldn't last an hour with the amount of CPU 
cycles it would be consuming on math operations.

Also, I have read the thread that was posted here by Daniel in August:

I have spoke with Daniel on this issue and we both agree it's very difficult 
to find a solution that can scale to very large sites.

I would very much appreciate any advice that some experienced users may have 
to offer me for such a situation. TIA


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Reply via email to